How to execute Scala script in Spark without creating Jar

Requirement

The spark-shell is an environment where we can run the spark scala code and see the output on the console for every execution of line of the code. It is more interactive environment. But, when we have more line of code, we prefer to write in a file and execute the file. There is a way to write the code into a file, create a jar and then use the jar package for executing the file in spark-submit.

Here, we will check how to run the spark code written in Scala without creating the jar package.

Sample Data

You can also download the sample data:

empno,ename,designation,manager,hire_date,sal,deptno
7369,SMITH,CLERK,7902,12/17/1980,800,20
7499,ALLEN,SALESMAN,7698,2/20/1981,1600,30
7521,WARD,SALESMAN,7698,2/22/1981,1250,30
7566,TURNER,MANAGER,7839,4/2/1981,2975,20
7654,MARTIN,SALESMAN,7698,9/28/1981,1250,30
7698,MILLER,MANAGER,7839,5/1/1981,2850,30
7782,CLARK,MANAGER,7839,6/9/1981,2450,10
7788,SCOTT,ANALYST,7566,12/9/1982,3000,20
7839,KING,PRESIDENT,NULL,11/17/1981,5000,10
7844,TURNER,SALESMAN,7698,9/8/1981,1500,30
7876,ADAMS,CLERK,7788,1/12/1983,1100,20
7900,JAMES,CLERK,7698,12/3/1981,950,30
7902,FORD,ANALYST,7566,12/3/1981,3000,20

Solution

Step 1: Setup

We will use the given sample data in the code. You can download the data from here and keep at any location. In my case, I have kept the file at ‘/home/bdp/data/employee.txt’.

Step 2: Write code

 import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}

object ReadTextFile {

  def main(args : Array[String]) : Unit = {
    var conf = new SparkConf().setAppName("Read Text File in Spark").setMaster("local[*]")
    //val sc = new SparkContext(conf)

    val textRDD = sc.textFile(args(0))

    // Read RDD
    textRDD.collect().foreach(println)

    // Get Header of the File
    val header = textRDD.first()

    // Remove header
    val filterRDD = textRDD.filter(row => row != header)

    // Read RDD
    filterRDD.collect().foreach(println)

    // Data Count
    filterRDD.count
  }
}

In the code, we are passing a path of the input file as an argument.

Step 3: Execution

We have written the code in a file. Now, lets execute it in spark-shell. It can be done in many ways:

Script Execution Directly
Open spark-shell and load the file
cat file_name.scala | spark-shell

Approach 1: Script Execution Directly

In this approach, start the spark-shell with the script. It will compile the file. Once spark-shell open, just need to call the main method.

 [root@sandbox-hdp ~]# /usr/hdp/2.6.5.0-292/spark/bin/spark-shell -i /home/bdp/codebase/ReadTextFile.scala

Here, I am running with Spark 1.6 version. You will see the output on console like below:

 SQL context available as sqlContext. 
Loading /home/bdp/codebase/ReadTextFile.scala... 
import org.apache.spark.sql.SQLContext 
import org.apache.spark.{SparkConf, SparkContext} 
defined module ReadTextFile

Now, call the main method with an argument:

 scala> ReadTextFile.main(Array("file:///home/bdp/data/employee.txt"))

Here, if your code doesn’t require any argument, then you mention null instead of file path.

Approach 2: Loading the Script

In this approach, once the spark-shell is open, load the script using below method:

 scala> :load /home/bdp/codebase/ReadTextFile.scala

Call the Main Method:

The code requires a input file path as an argument.

 scala> ReadTextFile.main(Array("file:///home/bdp/data/employee.txt"))

In addition to this, if you need to add any jar file in spark-shell, then you can do using below command:
In Spark 1.6.x
:cp <local path of jar>
In Spark 2.x
 :require <local path of jar>

Approach 3: By Reading the Script

Read the script and pass it to spark-shell using below command and call the main method just like before.

 [root@sandbox-hdp codebase]# cat ReadTextFile.scala | /usr/hdp/2.6.5.0-292/spark/bin/spark-shell

Wrapping Up

In this post, we have seen how to run the scala script in spark-shell without creating a jar. It is very useful when you are doing any testing of the code. With this, you can easily modified the scala code and reload the file to test with the new changes.

It is not recommended for the production job. It is good if you are testing the job or doing a short of POC.