Top 35 data engineer interview questions and answers

Q 1. What is the Retention Period in Kafka Topic ?
Answer : Kafka events or Messages older than retention period will not be available for Consumer to Consume. Because it gets deleted. Kafka Retain the data only for retention period. Let;s say if Retention period is 7 days, Then You cannot get the data older than 7 days.

Q 2. One Job which reads data from kafka topic and then process it,is failing because it has offset to read which are older than retention period?

Answer : it should have property failOnDataLoss to be false , in that case it will not fail the job with error, But you would lose the data.

below is the description of property from Kafka Documentation

failOnDataLoss: Whether to fail the query when it’s possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn’t work as you expected. Batch queries will always fail if it fails to read any data from the provided offsets due to lost data.

Q 3. Which mechanism [Push Or Pull] is used by Kafka Consumer to read data from Brokers ?

Answer :It is Pulled from the broker by Consumer.Because It if it is Pushed by Brokers to the Consumers , Rate is Controlled by Broker and not by the Consumer. If Pulled , then each consumer can read from broker at the maximum rate of Consumer, Independent of Other consumers which might be Slower or Faster. That’s why Pull Mechanism is used.

Q 4. Is it Possible to Read data from kafka topic from fixed offset using command line ?

Answer : Yes it is possible , Depending on partitions of topic ,we will have different offset. To read date from fixed offset use below command.

Just change the m and n as per your requirement –partition m –offset n

kafka-console-consumer.sh –bootstrap-server localhost:9092 –topic sampleTopic1 –property print.key=true –partition 0 –offset 12

Author’s Recommendations

Kafka Console Producer and Consumer
Apache Kafka Series – Learn Apache Kafka for Beginners

Q 5. There is a json file with following content :-

{“dept_id”:101,”e_id”:[10101,10102,10103]}

{“dept_id”:102,”e_id”:[10201,10202]}

And data is loaded into spark dataframe say mydf, having below dtypes

dept_id: bigint, e_id: array<bigint>

What will be the best way to get the e_id individually with dept_id ?

Answer :

we can use the explode function , which will explode as per the number of items in e_id .

The code would be like

mydf.withColum(“e_id”,explode($”e_id”)).

Here we have taken the new column same as old column, the dtypes of opdf will be

dept_id: bigint, e_id:bigint

So output would look like

+———+——-+

| dept_id| e_id |

+———+——-+

| 101 | 10101|

| 101 | 10102|

| 101 | 10103|

| 102 | 10201|

| 102 | 10202|

+———-+——+

Q 6 . How many number of column will be present in the df2, if df1 have three columns a1,a2,a3

Var df2=df.withColumn(“b1”,lit(“a1”)).withColumn(“a1”,lit(“a2”)).withColumn(“a2”,$“a2”).withColumn(“b2”,$”a3”)).withColumn(“a3”,lit(“b1”))

Answer :

Total 5 As below

Df // a1,a2,a3

df.withColumn(“b1”,lit(“a1”)) //a1,a2,a3,b1

.withColumn(“a1”,lit(“a2”)) //a1,a2,a3,b1

.withColumn(“a2”,$“a2”) //a1,a2,a3,b1

.withColumn(“b2”,$”a3”))//a1,a2,a3,b1,b2

.withColumn(“a3”,lit(“b1”))//a1,a2,a3,b1,b2

Q 7 . How to get RDD with its element indices.

Say myrdd = (a1,b1,c1,s2,s5)

Output should be

((a1,0),(b1,1),(c1,2),(s2,3),(s5,4))

Answer :

we can use zipWithIndex function

var myrdd_windx = myrdd.zipWithIndex()

For more Interview Questions visit here
For any coding help in Big Data ask to our expert here

Q 8 . What is the use of query.awaitTermination() In structured streaming?

Answer : In batch processing, we generally load the whole data and store the data at once. But in real time streaming, we get data in micro batches mostly based on trigger processing time, hence the streaming query should be running till the Termination by any failure or other task. Hence it should wait till termination and keep on processing the real time data.

Q 9 . Spark automatically monitors cache usage on each node and drops out old data partitions. What is the manual way of doing it ?

Answer : RDD.unpersist() method will delete the cached data.

Q 10 . in Spark SQL ,what would be the output of below :

SELECT true <=> NULL;

Answer : false

Q 11. IS it possible to load images in spark dataframe ?

Answer Yes using below command

spark.read.format(“image”).load(“path of image”)

It will have a fixed set of columns ,Out of which data is stored is binary format.

Q 12. In your spark application , there is dependency of one class which is present in a jar file (abcd.jar) , Somewhere in your cluster . You don’t have a fat jar for your application , How would you use it ?

Answer : while submitting a spark job using spark-submit , we can pass that abcd.jar with –jar , In that way it would be available everywhere ,and our main app can use it.

And if we want to use it in spark-shell ,then we can load it using

: load abcd.jar

Top 35 data engineer interview questions and answers – All in one

Leave a Reply Cancel reply

Hive Most Asked Interview Questions With Answers – Part I

Spark Interview Questions Part-1

Hive Most Asked Interview Questions With Answers – Part II

Hive Scenario Based Interview Questions with Answers

Scenario based interview questions on Big Data

Spark Scenario based Interview Questions

Spark Interview Questions – Part 2

Spark Scenario based Interview Questions with Answers – 2

Kafka Interview Questions

Top 35 data engineer interview questions and answers – All in one

Big Data Engineering Interview Questions

Certifications

Top Machine Learning Courses You Shouldn’t Miss

Top courses for data engineers

Top Big Data Courses on Udemy You should Take

Top 35 data engineer interview questions and answers – All in one

Leave a Reply Cancel reply

Tags