Requirement
The spark-shell is an environment where we can run the spark scala code and see the output on the console for every execution of line of the code. It is more interactive environment. But, when we have more line of code, we prefer to write in a file and execute the file. There is a way to write the code into a file, create a jar and then use the jar package for executing the file in spark-submit.
Here, we will check how to run the spark code written in Scala without creating the jar package.
Sample Data
You can also download the sample data:
empno,ename,designation,manager,hire_date,sal,deptno 7369,SMITH,CLERK,7902,12/17/1980,800,20 7499,ALLEN,SALESMAN,7698,2/20/1981,1600,30 7521,WARD,SALESMAN,7698,2/22/1981,1250,30 7566,TURNER,MANAGER,7839,4/2/1981,2975,20 7654,MARTIN,SALESMAN,7698,9/28/1981,1250,30 7698,MILLER,MANAGER,7839,5/1/1981,2850,30 7782,CLARK,MANAGER,7839,6/9/1981,2450,10 7788,SCOTT,ANALYST,7566,12/9/1982,3000,20 7839,KING,PRESIDENT,NULL,11/17/1981,5000,10 7844,TURNER,SALESMAN,7698,9/8/1981,1500,30 7876,ADAMS,CLERK,7788,1/12/1983,1100,20 7900,JAMES,CLERK,7698,12/3/1981,950,30 7902,FORD,ANALYST,7566,12/3/1981,3000,20
Solution
Step 1: Setup
We will use the given sample data in the code. You can download the data from here and keep at any location. In my case, I have kept the file at ‘/home/bdp/data/employee.txt’.
Step 2: Write code
import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext} object ReadTextFile { def main(args : Array[String]) : Unit = { var conf = new SparkConf().setAppName("Read Text File in Spark").setMaster("local[*]") //val sc = new SparkContext(conf) val textRDD = sc.textFile(args(0)) // Read RDD textRDD.collect().foreach(println) // Get Header of the File val header = textRDD.first() // Remove header val filterRDD = textRDD.filter(row => row != header) // Read RDD filterRDD.collect().foreach(println) // Data Count filterRDD.count } }
In the code, we are passing a path of the input file as an argument.
Step 3: Execution
We have written the code in a file. Now, lets execute it in spark-shell. It can be done in many ways:
- Script Execution Directly
- Open spark-shell and load the file
- cat file_name.scala | spark-shell
Approach 1: Script Execution Directly
In this approach, start the spark-shell with the script. It will compile the file. Once spark-shell open, just need to call the main method.
[root@sandbox-hdp ~]# /usr/hdp/2.6.5.0-292/spark/bin/spark-shell -i /home/bdp/codebase/ReadTextFile.scala
Here, I am running with Spark 1.6 version. You will see the output on console like below:
SQL context available as sqlContext. Loading /home/bdp/codebase/ReadTextFile.scala... import org.apache.spark.sql.SQLContext import org.apache.spark.{SparkConf, SparkContext} defined module ReadTextFile
Now, call the main method with an argument:
scala> ReadTextFile.main(Array("file:///home/bdp/data/employee.txt"))
Here, if your code doesn’t require any argument, then you mention null instead of file path.
Approach 2: Loading the Script
In this approach, once the spark-shell is open, load the script using below method:
scala> :load /home/bdp/codebase/ReadTextFile.scala
Call the Main Method:
The code requires a input file path as an argument.
scala> ReadTextFile.main(Array("file:///home/bdp/data/employee.txt"))
In addition to this, if you need to add any jar file in spark-shell, then you can do using below command:
In Spark 1.6.x:cp <local path of jar> In Spark 2.x:require <local path of jar>
Approach 3: By Reading the Script
Read the script and pass it to spark-shell using below command and call the main method just like before.
[root@sandbox-hdp codebase]# cat ReadTextFile.scala | /usr/hdp/2.6.5.0-292/spark/bin/spark-shell
Wrapping Up
In this post, we have seen how to run the scala script in spark-shell without creating a jar. It is very useful when you are doing any testing of the code. With this, you can easily modified the scala code and reload the file to test with the new changes.
It is not recommended for the production job. It is good if you are testing the job or doing a short of POC.