- Suppose you have a spark dataframe which contains millions of records. You need to perform multiple actions on it. How will you minimize the execution time?
Answer : You can use cache or persist. For eg say you have dataframe df and if you use df1=df.cache() ,then df1 will be stored in its storage. once it is stored in its storage, multiple actions can be performed. Only first action will take longer time than others because on the first action, it actually caches the data. You can check the storage size of df1 from spark application tracker.
You can pass different storage level in persist.
Different storage levels are :
2. What happens to spark application which you have run using spark submit and then press ctrl+c .Consider two cases :
A. If deploy mode is client
B. if deploy mode is cluster.
Answer : When submitted in client mode the driver runs on the machine through which you have submitted the spark application. So when ctrl+C has pressed ,it kills the whole application because driver’s execution is killed. Unlike in cluster, mode driver runs on cluster’s Worker nodes.Hence even if you press ctrl+c ,it will keep on running.
3. How would you load the data of hive table into spark dataframe?
Answer: You can use spark.table(“table name”) to get the dataframe.
Note : If you want to practice it you can refer to this post.
4. How would you name the spark application to track it?
Answer : You can use –name “appName” parameter while submitting the application using spark-submit.
5. If your cluster have limited resources, and there are many applications which need to be run, how would you ensure that your spark application will take the fixed number of resource and hence does not impact execution of other applications?
Answer : While submitting the spark application pass these two parameters .
–conf spark.dynamicAllocation.enabled = false
Note: you can change the number of executors if you need.
6. How would you limit the number of records (say 1000) in spark dataframe.
Answer : You can use df.limit(1000) function to limit the number of rows.
7. Give an example to describe map and flatmap in RDD.
Answer : Let’s say below is the RDD of Array[String]
- scala> val rdd = sc.parallelize(Seq("java python scala", "sql C C++ Kotlin "))
If you use flatMap . and inline function to split on the basis of space .Then there would be an Array of string in the output where count of output may or may not be the same.
- scala> rdd.flatMap(x => x.split(" ")).collect()
- res3: Array[String] = Array(java, python, scala, sql, C, C++, Kotlin)
But if You use map and inline function to split on the basis of space, then for one element there would be an array ,resulting Array[Array[String]].and the count will be the same as the count of rdd.
- scala> rdd.map(x => x.split(" ")).collect()
- res4: Array[Array[String]] = Array(Array(java, python, scala), Array(sql, C, C++, Kotlin))
8. How would you kill the spark application which is running?
Answer : Search for the name of spark application and get the application ID ,and then use below command:
- yarn application -kill appID
9.How would you check if rdd is empty, without using collect?
Answer : You can use rdd.isEmpty ,it will return true if rdd is empty.
10 . How will you convert rdd to df?
Answer: using below function
- var mydf=spark.createDataFrame(myrdd,schema)
You can refer this post
If you want to test your skills on spark,Why don’t you take the quiz : Spark-Quiz
Don’t forget to subscribe us.
Keep Sharing Keep Learning