Join in spark using scala with example

Join in spark using scala with example

Requirement

You have two table named as A and B. and you want to perform all types of join in spark using scala. It will help you to understand, how join works in spark scala.

Solution

Step 1: Input Files

Download file  Aand B from here. And place them into a local directory.

File A and B are the comma delimited file, please refer below :-

I am placing these files into local directory ‘sample_files’

to see local files
 
  1. cd sample_files
  2. ls -R *

Step 2: Loading the files into Hive.

To load the files into hive,Let’s first put these files into hdfs location using below commands.

put into hdfs
 
  1. hadoop fs -mkdir bdps/sample_files
  2. hadoop fs -mkdir bdps/sample_files/A
  3. hadoop fs -mkdir bdps/sample_files/B
  4. hadoop fs -put A/A.txt bdps/sample_files/A/
  5. hadoop fs -put B/B.txt bdps/sample_files/B/

 

you can check the files in hdfs using below command.

check files in hdfs
 
  1. hadoop fs -ls -R hdfs://sandbox-hdp.hortonworks.com:8020/user/root/bdps/sample_files/

Now let’s create two hive table A and B for both the files,using below commands:-

hive table creation
 
  1. CREATE SCHEMA IF NOT EXISTS bdp;
  2. CREATE EXTERNAL TABLE IF NOT EXISTS bdp.A
  3. (id INT,
  4. TYPE STRING)
  5. ROW FORMAT DELIMITED
  6. FIELDS TERMINATED BY ','
  7. STORED AS TEXTFILE
  8. LOCATION 'hdfs://sandbox-hdp.hortonworks.com:8020/user/root/bdps/sample_files/A';
  9. CREATE EXTERNAL TABLE IF NOT EXISTS bdp.B
  10. (id INT,
  11. TYPE STRING)
  12. ROW FORMAT DELIMITED
  13. FIELDS TERMINATED BY ','
  14. STORED AS TEXTFILE
  15. LOCATION 'hdfs://sandbox-hdp.hortonworks.com:8020/user/root/bdps/sample_files/B';

Let’s check whether data populated correctly or not using below commands :-

check data of hive tables
 
  1. SELECT * FROM bdp.A;
  2. SELECT * FROM bdp.B;

Step3: Loading Tables in spark scala

Now enter into spark shell using below command ,

spark shell
 
  1. spark-shell

Note : I am using spark version 2.3

use below command to load hive tables in to dataframe :-

load table into dataframe
 
  1. var A=spark.table("bdp.A")
  2. var B=spark.table("bdp.B")

and check data using below command :-

check dataframe data
 
  1. A.show()
  2. B.show()

 

Let’s understand join one by one

A. Inner Join:

Sometimes it is required to have only common records out of two datasets. Now we have two table A & B, we are joining based on a key which is id.
So in output, only those records which match id with another dataset will come. Rest will be discarded.

Use below command to perform the inner join in scala.

 
 
  1. var inner_df=A.join(B,A("id")===B("id"))

 

Expected output:


Use below command to see the output set.

 
 
  1. inner_df.show()

Please refer below screen shot for reference.


As you can see only records which have the same id such as 1, 3, 4 are present in the output, rest have been discarded.

B. Left Join

this type of join is performed when we want to look up something from other datasets, the best example would be fetching a phone no of an employee from other datasets based on employee code.
Use below command to perform left join.

 
 
  1. var left_df=A.join(B,A("id")===B("id"),"left")

Expected output

Use below command to see the output set.

 
 
  1. left_df.show()

Now we have all the records of left table A and matched records of table B.

C. Right Join

This type of join is performed when we want to get all the data of look-up table with only matching records of left table.

Use below command to perform right join.

 
 
  1. var right_df=A.join(B,A("id")===B("id"),"right")

Expected output

Use below command to see the output set.

 
 
  1. right_df.show()

Now we have all the records of right table B and matched records of table A.

D.Full Join

When it is needed to get all the matched and unmatched records out of two datasets, we can use full join. All data from left as well as from right datasets will appear in result set. Nonmatching records will have null have values in respective columns.
Use below command to perform full join.

 
 
  1. var full_df=A.join(B,A("id")===B("id"),"full")

Expected output

Use below command to see the output set.

 
 
  1. full_df.show()

 

Now we have all matched and unmatched records in output as shown below.

 

Wrapping Up

Joins are important when you have to deal with data which are present in more than a table. In real time we get files from many sources which have a relation between them, so to get meaningful information from these data-sets it needs to perform join to get combined result.

We are happy to help you, don’t forget to subscribe us. Keep learning.

1

Join in hive with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in pyspark with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.