How to create spark application in IntelliJ

How to create spark application in IntelliJ

Requirement

In spark-shell, it creates an instance of spark context as sc. Also, we don’t require to resolve dependency while working on spark shell. But it all requires if you move from spark shell to IDE. So how to create spark application in IntelliJ? In this post, we are going to create a spark application using IDE. We will also able to execute the job in local and will see the output in the IDE console.

Prerequisite

Before going to jump into the steps, these are the prerequisite:

  • JDK 1.6 or above
  • IntelliJ IDEA IDE
  • SBT

Solution

Follow the below steps:

Step 1: Create SBT Project

Go to File->New->Project. A window will occur on your screen:

Choose SBT and click Next.

Here, fill the following entry:

Name: Give any project name. In my case I gave SparkJob

Location: Workspace location

JDK: If you see nothing, then click on New option and provide JDK location.

SBT: Keep as it is.

Scala: Here you can change the version of Scala from the dropdown. I have kept 2.10.4
Once all done, click Finish.

You will able to see your project like this (above screenshot). If you get any option about the SBT Enable Auto Setup then enable it. It will import all the required plugins for the project.

Step 2: Resolve Dependency

In this step, we will update the build.sbt by adding Library dependency. This will download all the dependency.

This file contains project name, version, and scalaVersion configuration. Let’s make an entry for spark core dependency and scala-library.
Add below lines in the file

build.sbt
 
  1. libraryDependencies++=Seq(
  2. "org.scala-lang"%"scala-library"%"2.10.4",
  3. "org.apache.spark"%"spark-core_2.10"%"1.6.0"
  4. )


Once you save the file, IntelliJ will start downloading the dependency.

Step 3: Input Data

We will use below sample data for this spark application. I have created a directory resource under main and kept the data in a file called emp_data.txt.

sample_data
 
  1. empno,ename,designation,manager,hire_date,sal,deptno
  2. 7369,SMITH,CLERK,9902,2010-12-17,800.00,20
  3. 7499,ALLEN,SALESMAN,9698,2011-02-20,1600.00,30
  4. 7521,WARD,SALESMAN,9698,2011-02-22,1250.00,30
  5. 9566,JONES,MANAGER,9839,2011-04-02,2975.00,20
  6. 7654,MARTIN,SALESMAN,9698,2011-09-28,1250.00,30
  7. 9698,BLAKE,MANAGER,9839,2011-05-01,2850.00,30
  8. 9782,CLARK,MANAGER,9839,2011-06-09,2450.00,10
  9. 9788,SCOTT,ANALYST,9566,2012-12-09,3000.00,20
  10. 9839,KING,PRESIDENT,NULL,2011-11-17,5000.00,10
  11. 7844,TURNER,SALESMAN,9698,2011-09-08,1500.00,30
  12. 7876,ADAMS,CLERK,9788,2012-01-12,1100.00,20
  13. 7900,JAMES,CLERK,9698,2011-12-03,950.00,30
  14. 9902,FORD,ANALYST,9566,2011-12-03,3000.00,20
  15. 7934,MILLER,CLERK,9782,2012-01-23,1300.00,10

Download the sample data from here emp_data

Step 4: Write Spark Job

All set up is done. Now create a scala obj and write a small code which will load the file and read the records from the file.

Right click on scala dir-> New -> Scala Class


Give a name of the script and choose kind as obj.

Write the below code:

LoadData.obj
 
importorg.apache.spark.{SparkConf,SparkContext}
object LoadData {
 def main(args: Array[String]): Unit = {
 val conf = new SparkConf().setAppName("Spark Job for Loading Data").setMaster("local[*]") // local[*] will access all core of your machine
 val sc = new SparkContext(conf) // Create Spark Context
 // Load local file data
 val emp_data = sc.textFile("src\\main\\resources\\emp_data.txt") // It will return a RDD
 // Read the records
 println(emp_data.foreach(println))
 }
}

Step 5: Execution

Spark job is ready for the execution. Right-click and choose Run ‘LoadData’. Once you click on run, you will able to see spark execution on the console and you will see the records of the file as an output:

 
 
  1. 17/09/17 19:51:26 INFO HadoopRDD: Input split: file:/C:/Test/SparkJob/src/main/resources/emp_data.txt:353+353
  2. 17/09/17 19:51:26 INFO HadoopRDD: Input split: file:/C:/Test/SparkJob/src/main/resources/emp_data.txt:0+353
  3. 17/09/17 19:51:26 INFO deprecation: mapred.tip.id is deprecated. Instead, use mapreduce.task.id
  4. 17/09/17 19:51:26 INFO deprecation: mapred.task.id is deprecated. Instead, use mapreduce.task.attempt.id
  5. 17/09/17 19:51:26 INFO deprecation: mapred.task.is.map is deprecated. Instead, use mapreduce.task.ismap
  6. 17/09/17 19:51:26 INFO deprecation: mapred.task.partition is deprecated. Instead, use mapreduce.task.partition
  7. 17/09/17 19:51:26 INFO deprecation: mapred.job.id is deprecated. Instead, use mapreduce.job.id
  8. empno,ename,designation,manager,hire_date,sal,deptno
  9. 9788,SCOTT,ANALYST,9566,2012-12-09,3000.00,20
  10. 7369,SMITH,CLERK,9902,2010-12-17,800.00,20
  11. 9839,KING,PRESIDENT,NULL,2011-11-17,5000.00,10
  12. 7499,ALLEN,SALESMAN,9698,2011-02-20,1600.00,30
  13. 7844,TURNER,SALESMAN,9698,2011-09-08,1500.00,30
  14. 7876,ADAMS,CLERK,9788,2012-01-12,1100.00,20
  15. 7900,JAMES,CLERK,9698,2011-12-03,950.00,30
  16. 9902,FORD,ANALYST,9566,2011-12-03,3000.00,20
  17. 7934,MILLER,CLERK,9782,2012-01-23,1300.00,10
  18. 7521,WARD,SALESMAN,9698,2011-02-22,1250.00,30
  19. 9566,JONES,MANAGER,9839,2011-04-02,2975.00,20
  20. 7654,MARTIN,SALESMAN,9698,2011-09-28,1250.00,30
  21. 9698,BLAKE,MANAGER,9839,2011-05-01,2850.00,30
  22. 9782,CLARK,MANAGER,9839,2011-06-09,2450.00,10
  23. 17/09/17 19:51:26 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2044 bytes result sent to driver
  24. 17/09/17 19:51:26 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 2044 bytes result sent to driver
  25. 17/09/17 19:51:26 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 509 ms on localhost (1/2)
  26. 17/09/17 19:51:26 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 461 ms on localhost (2/2)
  27. 17/09/17 19:51:26 INFO DAGScheduler: ResultStage 0 (foreach at LoadData.scala:18) finished in 0.636 s
  28. 17/09/17 19:51:26 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
  29. 17/09/17 19:51:26 INFO DAGScheduler: Job 0 finished: foreach at LoadData.scala:18, took 5.461476 s

Wrapping Up

In this post, we have learned to create spark application in IntelliJ IDE and run it in local. You can extend the spark job by adding code for some transformation and action on the created RDD. We will cover this part in another post.

Please like, share, and subscribe if you like the post.

160
0

Join in hive with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in pyspark with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Join in spark using scala with example

Requirement You have two table named as A and B. and you want to perform all types of join in ...
Read More

Java UDF to convert String to date in PIG

About Code Many times it happens like you have received data from many systems and each system operates on a ...
Read More
/ java udf, Pig, pig, pig udf, string to date, udf

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.