Pre Splitting of hbase table

Requirement 

To distribute the load evenly in cluster, it is required to do pre splitting of hbase table at the time of creation of table . 

Solution : 

Pre splitting have the power to maintain the same amount of data in each region of hbase. It would be helpful when You know the hbase keys in advance and wants to distribute same number records in every region. It would Make reading from hbase tables Very fast.

This type of practice is useful when you are reading hbase table in spark and then processing it.

When there are same number of records ,one task doesn’t take long for finish than other. Hence advantage of parallelism can be taken.

Below is the scala code which identifies the split point based on given data .

Let’s have a look.

Step 1 : Imports

 import org.apache.spark.sql.expressions.window
import org.apache.spark.sql.functions.row_number

Step 2 : The Regions 

You need to define the number of regions which you need for hbase table.

 var regions=200

Step 3 : Loading the row keys and identification of split points 

Now you need to load the row key in dataframe to identify the splitting point. once done you can use rank function to identify the exact point where splitting should happen in hbase table.

After showing the data, let’s say X1,X2…Xn are the results of above code.

 var data_df=spark.table("dev.sample_dta").select("id".cast("string")).withColumn("rank",row_number().over(Window.orderBy("id"))).cache()
var records_per_region=(data_df.count()/regions).toInt

data_df.filter($"rank"===1 || $"rank" % records_per_region ===0).show(regions,false)

 

Step 4 : Creation of  Hbase Table  

Now open hbase shell and run below command after replacing with the actual values.

 create 'sample_table',{NAME => 'CF' , VERSIONS => 2 ,COMPRESSION => 'SNAPPY'},{SPLITS => ['X1','X2','X3','Xn']}

 

Step 5 : Load and Truncate 

Now You can load data in to hbase table, and it will be served to all regions if all the row keys are present. In case you want to delete the data use truncate_preserve instead of truncate. 

 

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply