How to add new column in Spark Dataframe

Requirement

When we ingest data from source to Hadoop data lake, we used to add some additional columns with the existing data source. These columns basically help to validate and analyze the data. So, in this post, we will walk through how we can add some additional columns with the source data.

In addition to this, we will also check how to drop an existing column and rename the column in the spark data frame.

Sample Data

 
 
  1. [
  2. {
  3. "empno":"7369",
  4. "ename":"SMITH",
  5. "designation":"CLERK",
  6. "manager":"7902",
  7. "hire_date":"12/17/1980",
  8. "sal":"800",
  9. "deptno":"20"
  10. },
  11. {
  12. "empno":"7499",
  13. "ename":"ALLEN",
  14. "designation":"SALESMAN",
  15. "manager":"7698",
  16. "hire_date":"2/20/1981",
  17. "sal":"1600",
  18. "deptno":"30"
  19. },
  20. {
  21. "empno":"7521",
  22. "ename":"WARD",
  23. "designation":"SALESMAN",
  24. "manager":"7698",
  25. "hire_date":"2/22/1981",
  26. "sal":"1250",
  27. "deptno":"30"
  28. },
  29. {
  30. "empno":"7566",
  31. "ename":"TURNER",
  32. "designation":"MANAGER",
  33. "manager":"7839",
  34. "hire_date":"4/2/1981",
  35. "sal":"2975",
  36. "deptno":"20"
  37. },
  38. {
  39. "empno":"7654",
  40. "ename":"MARTIN",
  41. "designation":"SALESMAN",
  42. "manager":"7698",
  43. "hire_date":"9/28/1981",
  44. "sal":"1250",
  45. "deptno":"30"
  46. },
  47. {
  48. "empno":"7698",
  49. "ename":"MILLER",
  50. "designation":"MANAGER",
  51. "manager":"7839",
  52. "hire_date":"5/1/1981",
  53. "sal":"2850",
  54. "deptno":"30"
  55. },
  56. {
  57. "empno":"7782",
  58. "ename":"CLARK",
  59. "designation":"MANAGER",
  60. "manager":"7839",
  61. "hire_date":"6/9/1981",
  62. "sal":"2450",
  63. "deptno":"10"
  64. },
  65. {
  66. "empno":"7788",
  67. "ename":"SCOTT",
  68. "designation":"ANALYST",
  69. "manager":"7566",
  70. "hire_date":"12/9/1982",
  71. "sal":"3000",
  72. "deptno":"20"
  73. },
  74. {
  75. "empno":"7839",
  76. "ename":"KING",
  77. "designation":"PRESIDENT",
  78. "manager":"NULL",
  79. "hire_date":"11/17/1981",
  80. "sal":"5000",
  81. "deptno":"10"
  82. }
  83. ]

Solution

Step 1: Set Up

We will use the given sample data in the code. You can download the data and keep at any location. In my case, I have kept the file at ‘/home/bdp/data/employees_multiLine.json’.

Step 2: Write Code

First load the data into a data frame.

 
 
  1. // Load json data:
  2. scala> val data = sqlContext.read.json(sc.wholeTextFiles("file:///home/bdp/data/employees_multiLine.json").values)
  3. data.printSchema
  4. scala> data.show

Add New Column in dataframe:

 
 
  1. scala> val ingestedDate = java.time.LocalDate.now
  2. scala> val jsonDfWithDate = data.withColumn("inegstedDate", lit(ingestedDate.toString()))

lit: Used to cast into literal value

Here, we have added a new column in data frame with a value.

Drop Column in DataFrame

 
 
  1.  scala> val jsonDfWithoutManager = jsonDfWithDate.drop("manager")
  2. scala> jsonDfWithoutManager.printSchema
  3. scala> jsonDfWithoutManager.show

Rename column in DataFrame

 
 
  1. scala> val jsonWithRename = jsonDfWithoutManager.withColumnRenamed("ingestedDate", "ingestedDateTime")
  2. scala> jsonWithRename.printSchema
  3. scala> jsonWithRename.show

You can download the code from the below link:

Wrapping Up

In this post, we have learned to add, drop and rename an existing column in the spark data frame. This is quite a common task we do whenever process the data using spark data frame.

Load CSV file into hive AVRO table

Requirement You have comma separated(CSV) file and you want to create Avro table in hive on top of it, then ...
Read More

Load CSV file into hive PARQUET table

Requirement You have comma separated(CSV) file and you want to create Parquet table in hive on top of it, then ...
Read More

Hive Most Asked Interview Questions With Answers – Part II

What is bucketing and what is the use of it? Answer: Bucket is an optimisation technique which is used to ...
Read More
/ hive, hive interview, interview-qa

Spark Interview Questions Part-1

Suppose you have a spark dataframe which contains millions of records. You need to perform multiple actions on it. How ...
Read More

Leave a Reply