1. If there is a csv file present in hdfs location which has a header . while reading it in to spark, which property needs to be set Answer : while reading in dataframe we need to set an option of header to true like below :- var df1=spark.read.option("header",true).csv("path") Read…
Q 1. What is the Retention Period in Kafka Topic ? Answer : Kafka events or Messages older than retention period will not be available for Consumer to Consume. Because it gets deleted. Kafka Retain the data only for retention period. Let;s say if Retention period is 7 days, ThenRead…
Q.1 There is a json file with following content :- {“dept_id”:101,”e_id”:[10101,10102,10103]} {“dept_id”:102,”e_id”:[10201,10202]} And data is loaded into spark dataframe say mydf, having below dtypes dept_id: bigint, e_id: array<bigint> What will be the best way to get the e_id individually with dept_id ? Answer : we can use the explode functionRead…
Q1 : What is the use of query.awaitTermination() In structured streaming? Answer : In batch processing, we generally load the whole data and store the data at once. But in real time streaming, we get data in micro batches mostly based on trigger processing time, hence the streaming query shouldRead…
There is one scala code written in a file myApp.scala ,is it possible to run the complete code in spark shell without manual copying of code ? Answer : Yes it is possible to run without copying , we just need to put the file in a directory fromRead…
1.There are 50 columns in one spark data frame say df.it is needed to cast all the columns into string. But to make the code more generic. It is not recommended to cast individual columns by writing column name.How would you achieve it in spark using scala? Answer : AsRead…
1. Let’s say a Hive table is created as an external table. If we drop the table, will the data be accessible? Answer: The data will be accessible even if the table gets dropped. We can get the data from the table’s HDFS location. 2. A Hive table is createdRead…
What is bucketing and what is the use of it? Answer: Bucketing is an optimization technique which is used to cluster the datasets into more manageable parts, which helps to optimize the query performance. Check the below post with the example. https://bigdataprogrammers.com/bucketing-in-hive/ What is Over Partitioning and what is theRead…
Suppose you have a spark dataframe which contains millions of records. You need to perform multiple actions on it. How will you minimize the execution time? Answer : You can use cache or persist. For eg say you have dataframe df and if you use df1=df.cache() ,then df1 will beRead…
What is Hive and why it is useful? Hive is a data warehouse application where data gets stored in the structure format. It is used to querying and managing large datasets. It provides a SQL-like interface to access the data which is also called HiveQL(HQL). What is the advantage ofRead…