- There is one scala code written in a file myApp.scala ,is it possible to run the complete code in spark shell without manual copying of code ?
Answer : Yes it is possible to run without copying , we just need to put the file in a directory from where we have started our spark shell. and in the spark shell we need to use below command
: load myApp.scala
You can mention the complete path if file is present somewhere else . It is useful when we are testing our application code before making a jar.
2. You have dataframe mydf which have three columns a1,a2,a3 , but it is required to have column a2 with the new name b2, how would you do it ?
Answer : There is one function in spark dataframe to rename the column . which is withColumnRenamed(“”) ,it takes two argument , the first is the name of existing column name and second one is the name of new column.
so the syntax would be :-
var newdf= mydf.withColumnRenamed("a2","b2")
3. Suppose you have two dataframe df1 and df2 , both have below columns :-
df1 => id , name, mobno
df2 => id ,pincode, address, city
After joining both the dataframe on the basis of key i.e id , while selecting id,name,mobno,pincode, address, city, you are getting an error ambiguous column id. how would you resolve it ?
Answer: selection of id columns depends on the type of join which we are performing.
- if it is inner join both the ids of df1 and df2 will have same values so before selecting we can drop any one id like :
var joined_df= df1.join(df2,df1("id") === df2("id")).drop(df2("id"))
OR
var joined_df= df1.join(df2,df1("id") === df2("id")).drop(df1("id"))
- if it is left join then we can drop the id which will have null values
var joined_df= df1.join(df2,df1("id") === df2("id")).drop(df2("id"))
- if it is right join then we can drop the id which will have null values
var joined_df= df1.join(df2,df1("id") === df2("id")).drop(df1("id"))
- if it is full join then we can rename both the ids df1(“id”) and df2(“id”) and use it as per the need.
4. You have list of columns which you need to select from a dataframe. The list gets updated every time you run the application , but the base dataframe ( say bsdf ) remains same.how would you select only columns which are there in the given list for that instance of Run.
Answer : let’s say the list is mycols which have all the required columns , we can use below command
var newdf= bsdf.select(mycols:_*)
here newdf will have different schema in every new run depending on the mycols.
5. If you have one dataframe df1 and one list which have some qualified cities where you need to run the offers. but df1 have all the cities where your business is running,How would you get the records only for qualified cities ?
Answer : we can use filter function and if records have city present in the qualified list , it will be qualified else it will be dropped.
var qualified_records= df1.filter($"city".isin(qualified_cities:_ *))
Read More Interview Questions here
- If you want to test your skills on spark,Why don’t you take the quiz : Spark-Quiz
- Don’t forget to subscribe us.
- Keep Sharing Keep Learning
- Don’t miss the tutorial on Top Big data courses on Udemy you should Buy