Interview Q&A

Big Data Engineering Interview Questions

1. If there is a csv file present in hdfs location which has a header . while reading it in to spark, which property needs to be set Answer : while reading in dataframe we need to set an option of header to true like below :- var df1=spark.read.option("header",true).csv("path")  Read…

Top 35 data engineer interview questions and answers – All in one

Q 1. What is the Retention Period in Kafka Topic ? Answer :  Kafka events or Messages older than retention period will not be available for Consumer to Consume. Because it gets deleted. Kafka Retain the data only for retention period. Let;s say if Retention period is 7 days, ThenRead…

Spark Scenario based Interview Questions with Answers – 2

Q.1 There is a json file with following content :- {“dept_id”:101,”e_id”:[10101,10102,10103]} {“dept_id”:102,”e_id”:[10201,10202]} And data is loaded into spark dataframe say mydf, having below dtypes dept_id: bigint, e_id: array<bigint> What will be the best way to get the e_id individually with dept_id ? Answer : we can use the explode functionRead…

Spark Interview Questions – Part 2

Q1 : What is the use of query.awaitTermination() In structured streaming? Answer : In batch processing, we generally load the whole data and store the data at once. But in real time streaming, we get data in micro batches mostly based on trigger processing time, hence the streaming query shouldRead…

Spark Scenario based Interview Questions

  There is one scala code written in a file myApp.scala ,is it possible to run the complete code in spark shell without manual copying of code ? Answer : Yes it is possible to run without copying , we just need to put the file in a directory fromRead…

Scenario based interview questions on Big Data

1.There are 50 columns in one spark data frame say df.it is needed to cast all the columns into string. But to make the code more generic. It is not recommended to cast individual columns by writing column name.How would you achieve it in spark using scala? Answer : AsRead…

Hive Scenario Based Interview Questions with Answers

1. Let’s say a Hive table is created as an external table. If we drop the table, will the data be accessible? Answer: The data will be accessible even if the table gets dropped. We can get the data from the table’s HDFS location. 2. A Hive table is createdRead…

Hive Most Asked Interview Questions With Answers – Part II

What is bucketing and what is the use of it? Answer: Bucketing is an optimization technique which is used to cluster the datasets into more manageable parts, which helps to optimize the query performance.  Check the below post with the example. https://bigdataprogrammers.com/bucketing-in-hive/ What is Over Partitioning and what is theRead…

Spark Interview Questions Part-1

Suppose you have a spark dataframe which contains millions of records. You need to perform multiple actions on it. How will you minimize the execution time? Answer : You can use cache or persist. For eg say you have dataframe df and if you use df1=df.cache() ,then df1 will beRead…

Hive Most Asked Interview Questions With Answers – Part I

What is Hive and why it is useful? Hive is a data warehouse application where data gets stored in the structure format. It is used to querying and managing large datasets. It provides a SQL-like interface to access the data which is also called HiveQL(HQL). What is the advantage ofRead…