Spark Scenario based Interview Questions with Answers – 2

Q.1 There is a json file with following content :-

{“dept_id”:101,”e_id”:[10101,10102,10103]}

{“dept_id”:102,”e_id”:[10201,10202]}

And data is loaded into spark dataframe say mydf, having below dtypes

dept_id: bigint, e_id: array<bigint>

What will be the best way to get the e_id individually with dept_id ?

Answer :

we can use the explode function , which will explode as per the number of items in e_id .

The code would be like

mydf.withColum(“e_id”,explode($”e_id”)).

Here we have taken the new column same as old column, the dtypes of opdf will be 

dept_id: bigint, e_id:bigint

So output would look like

+———+——-+

|  dept_id|  e_id |

+———+——-+

|  101    |  10101|

|  101    |  10102|

|  101    |  10103|

|  102    |  10201|

|  102    |  10202|

+———-+——+

 

Q2 . How many number of column will be present in the df2, if df1 have three columns a1,a2,a3

Var df2=df.withColumn(“b1”,lit(“a1”)).withColumn(“a1”,lit(“a2”)).withColumn(“a2”,$“a2”).withColumn(“b2”,$”a3”)).withColumn(“a3”,lit(“b1”))

Answer :

Total 5 As below

 Df // a1,a2,a3

 df.withColumn(“b1”,lit(“a1”)) //a1,a2,a3,b1

.withColumn(“a1”,lit(“a2”)) //a1,a2,a3,b1

.withColumn(“a2”,$“a2”) //a1,a2,a3,b1

.withColumn(“b2”,$”a3”))//a1,a2,a3,b1,b2

.withColumn(“a3”,lit(“b1”))//a1,a2,a3,b1,b2

 

Q3 . How to get RDD with its element indices.

Say myrdd = (a1,b1,c1,s2,s5)

Output should be 

((a1,0),(b1,1),(c1,2),(s2,3),(s5,4))

Answer : 

we can use zipWithIndex function

var myrdd_windx = myrdd.zipWithIndex()

For more Interview Questions visit here
For any coding help in Big Data ask to our expert here

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply