Spark Scenario based Interview Questions

There is one scala code written in a file myApp.scala ,is it possible to run the complete code in spark shell without manual copying of code ?

Answer : Yes it is possible to run without copying , we just need to put the file in a directory from where we have started our spark shell. and in the spark shell we need to use below command

 : load myApp.scala

You can mention the complete path if file is present somewhere else . It is useful when we are testing our application code before making a jar.

2. You have dataframe mydf which have three columns a1,a2,a3 , but it is required to have column a2 with the new name b2, how would you do it ?

Answer : There is one function in spark dataframe to rename the column . which is withColumnRenamed(“”) ,it takes two argument , the first is the name of existing column name and second one is the name of new column.

so the syntax would be :-

 var newdf= mydf.withColumnRenamed("a2","b2")

3. Suppose you have two dataframe df1 and df2 , both have below columns :-

df1 => id , name, mobno

df2 => id ,pincode, address, city

After joining both the dataframe on the basis of key i.e id , while selecting id,name,mobno,pincode, address, city, you are getting an error ambiguous column id. how would you resolve it ?

Answer: selection of id columns depends on the type of join which we are performing.

if it is inner join both the ids of df1 and df2 will have same values so before selecting we can drop any one id like :

var joined_df= df1.join(df2,df1("id") === df2("id")).drop(df2("id"))

 var joined_df= df1.join(df2,df1("id") === df2("id")).drop(df1("id"))

if it is left join then we can drop the id which will have null values

 var joined_df= df1.join(df2,df1("id") === df2("id")).drop(df2("id"))

if it is right join then we can drop the id which will have null values

 var joined_df= df1.join(df2,df1("id") === df2("id")).drop(df1("id"))

if it is full join then we can rename both the ids df1(“id”) and df2(“id”) and use it as per the need.

4. You have list of columns which you need to select from a dataframe. The list gets updated every time you run the application , but the base dataframe ( say bsdf ) remains same.how would you select only columns which are there in the given list for that instance of Run.

Answer : let’s say the list is mycols which have all the required columns , we can use below command

 var newdf= bsdf.select(mycols:_*)

here newdf will have different schema in every new run depending on the mycols.

5. If you have one dataframe df1 and one list which have some qualified cities where you need to run the offers. but df1 have all the cities where your business is running,How would you get the records only for qualified cities ?

Answer : we can use filter function and if records have city present in the qualified list , it will be qualified else it will be dropped.

 var qualified_records= df1.filter($"city".isin(qualified_cities:_ *))

Spark Scenario based Interview Questions

Leave a Reply Cancel reply

Hive Most Asked Interview Questions With Answers – Part I

Spark Interview Questions Part-1

Hive Most Asked Interview Questions With Answers – Part II

Hive Scenario Based Interview Questions with Answers

Scenario based interview questions on Big Data

Spark Scenario based Interview Questions

Spark Interview Questions – Part 2

Spark Scenario based Interview Questions with Answers – 2

Kafka Interview Questions

Top 35 data engineer interview questions and answers – All in one

Big Data Engineering Interview Questions

Certifications

Top Machine Learning Courses You Shouldn’t Miss

Top courses for data engineers

Top Big Data Courses on Udemy You should Take

Spark Scenario based Interview Questions

Leave a Reply Cancel reply

Tags