Scenario based interview questions on Big Data

1.There are 50 columns in one spark data frame say df.it is needed to cast all the columns into string. But to make the code more generic. It is not recommended to cast individual columns by writing column name.How would you achieve it in spark using scala?

Answer : As we have dataframe df.

Using columns function we can get all the columns of df.

Now using map function we can cast them dynamically and resulted list can be used in select.

Syntax will be :-

Var casted_list= df.columns.map(x => col(x).cast(“string”))

Var castedDf=df.select(casted_list:_*)

You can verify the schema of castedDf using

castedDf.printSchema

 

2. Suppose you are running a spark job 3 to 4 times everyday. And it loads the data into hive table.what would be the best approach to distinguish the data on the basis of time when it is loaded.

Answer : We can create a hive table which is partitioned on say batchtime which is nothing but a column generated while inserting data into hive table.

We can us below command to get the current time which will act as batchtime in hive table

var batchtime=System.currentTimeMillis()

And data frame which is storing data to partitioned table can have column batchtime which will act as partition column

df.withColumn(“batchtime”,lit(batchtime))

3. Assume you want to generate a unique id to each record of data frame,how would you achieve it.

Answer : we can use monotonically_increasing_id() in withColumn

4. How will you get a hdfs file into local directory.

Answer :

Using command

Hadoop fs – get hdfsdir local dir

 

5. How would you see the running application in yarn from the command line ?And how will you kill the application.

Answer :

Yarn application -list

Yarn application -kill appid

 

6. Say you have data of a website contains information of logged in user ,one user may have multiple fields. But the number of fields per user may vary based on his actions.In that case which component of hadoop you will use to store the data?

Answer : hbase , nosql db.

 

7. Say you have one hbase table. Is it possible to create a hive table on top of it. it should not be manual data movement activity ,Any changes in hbase table should replicate in hive table.without any changes or data movement ?

Answer : Yes we can achieve it by creating a hive table which can point hbase as data source. In that case ,if there is any change in the data of hbase ,It will be reflected in hive as well.

We need to use Hbase storage handler while creating a table.

 

8.Assume If data from external sources is getting populated in to hdfs in csv format on a daily basis,

How would you handle it efficiently so that it can be processed by other applications and also reduce the data storage

Answer : Using ORC or Parquet format in hive,

Deleting old hdfs data and 

Create business partdate as a partition in hive.

 

9.There are 5000000 Records in one hive table and you have loaded it i spark -shell for development purposes. What would be the best practice to write code.Would you be processing 5000000 records in each line of code?

Answer : In that case we can use limit function(say 1000 records ) ,cache it and then use it . And when the complete code is ready we can process all data.

 

10. Assume that you want to load a file having timestamp values (yyyy-MM-dd HH:mm:ss) into Apache Pig. After loading into Pig, add one day into each timestamp value.How will you achieve this ?

Answer :  For proper explanation please read 

https://bigdataprogrammers.com/load-timestamp-values-file-pig/

 

Tip : Don’t miss the tutorial on Top Big data courses on Udemy you should Buy

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply