Hive Most Asked Interview Questions With Answers – Part II

  1. What is bucketing and what is the use of it?

Answer: Bucket is an optimisation technique which is used to cluster the datasets into more manageable parts, which helps to optimise the query performance. 

Check the below post with the example.

http://bigdataprogrammers.com/bucketing-in-hive/

  1. What is Over Partitioning and what is the solution to overcome it?

Answer: If a partition table gets create based on the column contain a unique value, then a new partition will get created for each of the records.

  1. What is the difference between Partition and Bucketing?
Partition Bucketing
 It distributes the data horizontally for better performance.  It basically decomposes data into more manageable parts.
 There is no fixed number of partition(s) in a table.  The bucket is fixed for the table.
 If a partition column having a unique value, then the partition is not suitable.  It is well suited if the column having a unique value.
 It stores data in static as well as dynamically.  It stores data only dynamically.
 Execution will be faster with the low volume of data.  Map side join will be faster on the bucketed table.
 It doesn’t provide the feature to store the data as sorted order on the partition column(s) in each partition.  It provides the feature to store the data in sorted order on column(s) in each bucket.
 Partition data can be bucketed.  Bucketed data can’t be partitioned.
  1. What is window function in Hive?

Answer: The window functions in the hive are:

  • LEAD
  • LAG
  • FIRST_VALUE
  • LAST_VALUE

Go through the below link for more details with an example:

http://bigdataprogrammers.com/windowing-functions-in-hive/

  1. What is analytics function in Hive?

Answer: The analytics functions in the hive are:

  • ROW_NUMBER
  • RANK
  • DENSE_RANK etc.

Go through the below link for more details with an example:

http://bigdataprogrammers.com/analytics-functions-in-hive/

  1. What is the difference between Row Number, Rank, Dense Rank?

Answer: Go through the below post:

http://bigdataprogrammers.com/analytics-functions-in-hive/

  1. What are the different stores in the Hive?

Answer: The Hive provides features to store data into different formats such as

  • TextFile
  • RCFIle
  • ORC
  • Parquet
  1. What is the advantage of ORC over Parquet?

Answer:

  • ORC does indexing on the block level for each column. It helps to skip the entire block for reading if it determines the predictive value are not present there.
  • The ORC columns metadata is considered by Cost-Based Optimization (CBO) for generating the most efficient graph.
  • ACID transactions are only possible when using ORC storage format.
  1. What is the different between SNAPPY and GZIP?

Answer: Both of these are the compression in the HIve. GZIP is not splittable that means if your data gets stored into multiple blocks, then the only single resource will able to access at a time, whereas SNAPPY is splittable.

Snappy often performs better than other compressions.

GZIP uses more resource for compression compared to SNAPPY.

  1. What is the difference between SORTBY and ORDER BY?

Answer: In SORTBY, the data gets sorted within the reducer whereas in ORDER BY data gets combines and sorted on the entire data.

 

Subscribe us for getting the update on the new post.

Load CSV file into hive AVRO table

Requirement You have comma separated(CSV) file and you want to create Avro table in hive on top of it, then ...
Read More

Load CSV file into hive PARQUET table

Requirement You have comma separated(CSV) file and you want to create Parquet table in hive on top of it, then ...
Read More

Hive Most Asked Interview Questions With Answers – Part II

What is bucketing and what is the use of it? Answer: Bucket is an optimisation technique which is used to ...
Read More
/ hive, hive interview, interview-qa

Spark Interview Questions Part-1

Suppose you have a spark dataframe which contains millions of records. You need to perform multiple actions on it. How ...
Read More

Leave a Reply