Requirement
To distribute the load evenly in cluster, it is required to do pre splitting of hbase table at the time of creation of table .
Solution :
Pre splitting have the power to maintain the same amount of data in each region of hbase. It would be helpful when You know the hbase keys in advance and wants to distribute same number records in every region. It would Make reading from hbase tables Very fast.
This type of practice is useful when you are reading hbase table in spark and then processing it.
When there are same number of records ,one task doesn’t take long for finish than other. Hence advantage of parallelism can be taken.
Below is the scala code which identifies the split point based on given data .
Let’s have a look.
Step 1 : Imports
import org.apache.spark.sql.expressions.window import org.apache.spark.sql.functions.row_number
Step 2 : The Regions
You need to define the number of regions which you need for hbase table.
var regions=200
Step 3 : Loading the row keys and identification of split points
Now you need to load the row key in dataframe to identify the splitting point. once done you can use rank function to identify the exact point where splitting should happen in hbase table.
After showing the data, let’s say X1,X2…Xn are the results of above code.
var data_df=spark.table("dev.sample_dta").select("id".cast("string")).withColumn("rank",row_number().over(Window.orderBy("id"))).cache() var records_per_region=(data_df.count()/regions).toInt data_df.filter($"rank"===1 || $"rank" % records_per_region ===0).show(regions,false)
Step 4 : Creation of Hbase Table
Now open hbase shell and run below command after replacing with the actual values.
create 'sample_table',{NAME => 'CF' , VERSIONS => 2 ,COMPRESSION => 'SNAPPY'},{SPLITS => ['X1','X2','X3','Xn']}
Step 5 : Load and Truncate
Now You can load data in to hbase table, and it will be served to all regions if all the row keys are present. In case you want to delete the data use truncate_preserve instead of truncate.