How to execute Scala script in Spark without creating Jar

Requirement

The spark-shell is an environment where we can run the spark scala code and see the output on the console for every execution of line of the code. It is more interactive environment. But, when we have more line of code, we prefer to write in a file and execute the file. There is a way to write the code into a file, create a jar and then use the jar package for executing the file in spark-submit.

Here, we will check how to run the spark code written in Scala without creating the jar package.

Sample Data

You can also download the sample data:

empno,ename,designation,manager,hire_date,sal,deptno
7369,SMITH,CLERK,7902,12/17/1980,800,20
7499,ALLEN,SALESMAN,7698,2/20/1981,1600,30
7521,WARD,SALESMAN,7698,2/22/1981,1250,30
7566,TURNER,MANAGER,7839,4/2/1981,2975,20
7654,MARTIN,SALESMAN,7698,9/28/1981,1250,30
7698,MILLER,MANAGER,7839,5/1/1981,2850,30
7782,CLARK,MANAGER,7839,6/9/1981,2450,10
7788,SCOTT,ANALYST,7566,12/9/1982,3000,20
7839,KING,PRESIDENT,NULL,11/17/1981,5000,10
7844,TURNER,SALESMAN,7698,9/8/1981,1500,30
7876,ADAMS,CLERK,7788,1/12/1983,1100,20
7900,JAMES,CLERK,7698,12/3/1981,950,30
7902,FORD,ANALYST,7566,12/3/1981,3000,20

Solution

Step 1: Setup

We will use the given sample data in the code. You can download the data from here and keep at any location. In my case, I have kept the file at ‘/home/bdp/data/employee.txt’.

Step 2: Write code

 
 
  1. import org.apache.spark.sql.SQLContext
  2. import org.apache.spark.{SparkConf, SparkContext}
  3. object ReadTextFile {
  4.   def main(args : Array[String]) : Unit = {
  5.     var conf = new SparkConf().setAppName("Read Text File in Spark").setMaster("local[*]")
  6.     //val sc = new SparkContext(conf)
  7.     val textRDD = sc.textFile(args(0))
  8.     // Read RDD
  9.     textRDD.collect().foreach(println)
  10.     // Get Header of the File
  11.     val header = textRDD.first()
  12.     // Remove header
  13.     val filterRDD = textRDD.filter(row => row != header)
  14.     // Read RDD
  15.     filterRDD.collect().foreach(println)
  16.     // Data Count
  17.     filterRDD.count
  18.   }
  19. }

In the code, we are passing a path of the input file as an argument.

Step 3: Execution

We have written the code in a file. Now, lets execute it in spark-shell. It can be done in many ways:

  1. Script Execution Directly
  2. Open spark-shell and load the file
  3. cat file_name.scala | spark-shell

Approach 1: Script Execution Directly

In this approach, start the spark-shell with the script. It will compile the file. Once spark-shell open, just need to call the main method.

 
 
  1. [root@sandbox-hdp ~]# /usr/hdp/2.6.5.0-292/spark/bin/spark-shell -i /home/bdp/codebase/ReadTextFile.scala

Here, I am running with Spark 1.6 version. You will see the output on console like below:

 
 
  1. SQL context available as sqlContext.
  2. Loading /home/bdp/codebase/ReadTextFile.scala...
  3. import org.apache.spark.sql.SQLContext
  4. import org.apache.spark.{SparkConf, SparkContext}
  5. defined module ReadTextFile

Now, call the main method with an argument:

 
 
  1. scala> ReadTextFile.main(Array("file:///home/bdp/data/employee.txt"))

Here, if your code doesn’t require any argument, then you mention null instead of file path.

Approach 2: Loading the Script

In this approach, once the spark-shell is open, load the script using below method:

 
 
  1. scala> :load /home/bdp/codebase/ReadTextFile.scala

Call the Main Method:

The code requires a input file path as an argument.

 
 
  1. scala> ReadTextFile.main(Array("file:///home/bdp/data/employee.txt"))

In addition to this, if you need to add any jar file in spark-shell, then you can do using below command:
In Spark 1.6.x

 
 
  1. :cp <local path of jar>
  2. In Spark 2.x
 
 
  1. :require <local path of jar>

Approach 3: By Reading the Script

Read the script and pass it to spark-shell using below command and call the main method just like before.

 
 
  1. [root@sandbox-hdp codebase]# cat ReadTextFile.scala | /usr/hdp/2.6.5.0-292/spark/bin/spark-shell

Wrapping Up

In this post, we have seen how to run the scala script in spark-shell without creating a jar. It is very useful when you are doing any testing of the code. With this, you can easily modified the scala code and reload the file to test with the new changes.

It is not recommended for the production job. It is good if you are testing the job or doing a short of POC.

Load CSV file into hive AVRO table

Requirement You have comma separated(CSV) file and you want to create Avro table in hive on top of it, then ...
Read More

Load CSV file into hive PARQUET table

Requirement You have comma separated(CSV) file and you want to create Parquet table in hive on top of it, then ...
Read More

Hive Most Asked Interview Questions With Answers – Part II

What is bucketing and what is the use of it? Answer: Bucket is an optimisation technique which is used to ...
Read More
/ hive, hive interview, interview-qa

Spark Interview Questions Part-1

Suppose you have a spark dataframe which contains millions of records. You need to perform multiple actions on it. How ...
Read More

Leave a Reply