Requirement

In spark-shell, it creates an instance of spark context as sc. Also, we don’t require to resolve dependency while working on spark shell. But it all requires if you move from spark shell to IDE. So how to create spark application in IntelliJ? In this post, we are going to create a spark application using IDE. We will also able to execute the job in local and will see the output in the IDE console.

Prerequisite

Before going to jump into the steps, these are the prerequisite:

JDK 1.6 or above
IntelliJ IDEA IDE
SBT

Solution

Follow the below steps:

Step 1: Create SBT Project

Go to File->New->Project. A window will occur on your screen:

Choose SBT and click Next.

Here, fill the following entry:

Name: Give any project name. In my case I gave SparkJob

Location: Workspace location

JDK: If you see nothing, then click on New option and provide JDK location.

SBT: Keep as it is.

Scala: Here you can change the version of Scala from the dropdown. I have kept 2.10.4
Once all done, click Finish.

You will able to see your project like this (above screenshot). If you get any option about the SBT Enable Auto Setup then enable it. It will import all the required plugins for the project.

Step 2: Resolve Dependency

In this step, we will update the build.sbt by adding Library dependency. This will download all the dependency.

This file contains project name, version, and scalaVersion configuration. Let’s make an entry for spark core dependency and scala-library.
Add below lines in the file

 libraryDependencies++=Seq(
"org.scala-lang"%"scala-library"%"2.10.4",
"org.apache.spark"%"spark-core_2.10"%"1.6.0"
)

Once you save the file, IntelliJ will start downloading the dependency.

Step 3: Input Data

We will use below sample data for this spark application. I have created a directory resource under main and kept the data in a file called emp_data.txt.

 empno,ename,designation,manager,hire_date,sal,deptno
7369,SMITH,CLERK,9902,2010-12-17,800.00,20
7499,ALLEN,SALESMAN,9698,2011-02-20,1600.00,30
7521,WARD,SALESMAN,9698,2011-02-22,1250.00,30
9566,JONES,MANAGER,9839,2011-04-02,2975.00,20
7654,MARTIN,SALESMAN,9698,2011-09-28,1250.00,30
9698,BLAKE,MANAGER,9839,2011-05-01,2850.00,30
9782,CLARK,MANAGER,9839,2011-06-09,2450.00,10
9788,SCOTT,ANALYST,9566,2012-12-09,3000.00,20
9839,KING,PRESIDENT,NULL,2011-11-17,5000.00,10
7844,TURNER,SALESMAN,9698,2011-09-08,1500.00,30
7876,ADAMS,CLERK,9788,2012-01-12,1100.00,20
7900,JAMES,CLERK,9698,2011-12-03,950.00,30
9902,FORD,ANALYST,9566,2011-12-03,3000.00,20
7934,MILLER,CLERK,9782,2012-01-23,1300.00,10

Download the sample data from here emp_data

Step 4: Write Spark Job

All set up is done. Now create a scala obj and write a small code which will load the file and read the records from the file.

Right click on scala dir-> New -> Scala Class

Give a name of the script and choose kind as obj.

Write the below code:

importorg.apache.spark.{SparkConf,SparkContext}
object LoadData {
 def main(args: Array[String]): Unit = {
 val conf = new SparkConf().setAppName("Spark Job for Loading Data").setMaster("local[*]") // local[*] will access all core of your machine
 val sc = new SparkContext(conf) // Create Spark Context

 // Load local file data
 val emp_data = sc.textFile("src\\main\\resources\\emp_data.txt") // It will return a RDD

 // Read the records
 println(emp_data.foreach(println))
 }
}

Step 5: Execution

Spark job is ready for the execution. Right-click and choose Run ‘LoadData’. Once you click on run, you will able to see spark execution on the console and you will see the records of the file as an output:

 17/09/17 19:51:26 INFO HadoopRDD: Input split: file:/C:/Test/SparkJob/src/main/resources/emp_data.txt:353+353
17/09/17 19:51:26 INFO HadoopRDD: Input split: file:/C:/Test/SparkJob/src/main/resources/emp_data.txt:0+353
17/09/17 19:51:26 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
17/09/17 19:51:26 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
17/09/17 19:51:26 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
17/09/17 19:51:26 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
17/09/17 19:51:26 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
empno,ename,designation,manager,hire_date,sal,deptno
9788,SCOTT,ANALYST,9566,2012-12-09,3000.00,20
7369,SMITH,CLERK,9902,2010-12-17,800.00,20
9839,KING,PRESIDENT,NULL,2011-11-17,5000.00,10
7499,ALLEN,SALESMAN,9698,2011-02-20,1600.00,30
7844,TURNER,SALESMAN,9698,2011-09-08,1500.00,30
7876,ADAMS,CLERK,9788,2012-01-12,1100.00,20
7900,JAMES,CLERK,9698,2011-12-03,950.00,30
9902,FORD,ANALYST,9566,2011-12-03,3000.00,20
7934,MILLER,CLERK,9782,2012-01-23,1300.00,10
7521,WARD,SALESMAN,9698,2011-02-22,1250.00,30
9566,JONES,MANAGER,9839,2011-04-02,2975.00,20
7654,MARTIN,SALESMAN,9698,2011-09-28,1250.00,30
9698,BLAKE,MANAGER,9839,2011-05-01,2850.00,30
9782,CLARK,MANAGER,9839,2011-06-09,2450.00,10
17/09/17 19:51:26 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2044 bytes result sent to driver
17/09/17 19:51:26 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2044 bytes result sent to driver
17/09/17 19:51:26 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 509 ms on localhost (1/2)
17/09/17 19:51:26 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 461 ms on localhost (2/2)
17/09/17 19:51:26 INFO DAGScheduler: ResultStage 0 (foreach at LoadData.scala:18) finished in 0.636 s
17/09/17 19:51:26 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
17/09/17 19:51:26 INFO DAGScheduler: Job 0 finished: foreach at LoadData.scala:18, took 5.461476 s

Wrapping Up

In this post, we have learned to create spark application in IntelliJ IDE and run it in local. You can extend the spark job by adding code for some transformation and action on the created RDD. We will cover this part in another post.

Please like, share, and subscribe if you like the post.

How to create spark application in IntelliJ