- What is bucketing and what is the use of it?
Answer: Bucketing is an optimization technique which is used to cluster the datasets into more manageable parts, which helps to optimize the query performance.
Check the below post with the example.
https://bigdataprogrammers.com/bucketing-in-hive/
- What is Over Partitioning and what is the solution to overcome it?
Answer: If a partition table created based on the column contains a unique value, then a new partition will get created for each of the records. In this case, the number of partitions would be very high. The partition is effective if there is a limited number of partitions available and comparatively equal sized.
We can use bucketing to overcome this issue.
- What is the difference between Partitioning and Bucketing?
Partitioning | Bucketing |
It distributes the data horizontally for better performance. | It basically decomposes data into more manageable parts. |
There is no fixed number of partition(s) in a table. | The bucket is fixed for the table. |
If a partition column having a unique value, then the partition is not suitable. | It is well suited if the column having a unique value. |
It stores data in static as well as dynamically. | It stores data only dynamically. |
Execution will be faster with the low volume of data. | Map side join will be faster on the bucketed table. |
It doesn’t provide the feature to store the data as sorted order on the partition column(s) in each partition. | It provides the feature to store the data in sorted order on column(s) in each bucket. |
Partition data can be bucketed. | Bucketed data can’t be partitioned. |
- What are the window functions in Hive?
Answer: The window functions are:
- LEAD
- LAG
- FIRST_VALUE
- LAST_VALUE
Go through the below link for more details with an example:
https://bigdataprogrammers.com/windowing-functions-in-hive/
- What are the analytic functions in Hive?
Answer: The analytic functions are:
- ROW_NUMBER
- RANK
- DENSE_RANK etc.
Go through the below link for more details with an example:
https://bigdataprogrammers.com/analytics-functions-in-hive/
- What is the difference between Row Number, Rank and Dense Rank?
Answer: Go through the below post:
https://bigdataprogrammers.com/analytics-functions-in-hive/
- What are the different storage formats in the Hive?
Answer: The Hive provides features to store data into different formats such as
- TextFile
- SequenceFile
- RCFIle
- ORC
- Parquet
- AVRO
- JSONFILE
- What is the advantage of ORC over Parquet?
Answer:
- ORC does indexing on the block level for each column. It helps to skip the entire block for reading if it determines the predictive value are not present there.
- The ORC columns metadata is considered by Cost-Based Optimization (CBO) for generating the most efficient graph.
- ACID transactions are only possible when using ORC storage format.
- What is the difference between SNAPPY and GZIP?
Answer: Both of these are the compression in the HIve. GZIP is not splittable that means if your data gets stored into multiple blocks, then the only single resource will able to access at a time, whereas SNAPPY is splittable.
Snappy often performs better than other compressions.
GZIP uses more resource for compression compared to SNAPPY.
- What is the difference between SORTBY and ORDER BY?
Answer: In SORTBY, the data gets sorted within the reducer whereas in ORDER BY data gets combines and sorted on the entire data.
Subscribe us for getting the update on the new post.