Requirement
In the previous post, we have learned about when and how to use SELECT in DataFrame. It is useful when we want to select a column, all columns of a DataFrames. Let’s say we want to add any expression in the query like length, case statement, etc, then SELECT will not be able to fulfill the requirement. There is am another option SELECTExpr. Here, In this post, we are going to learn SelectExpr in DataFrames.
You can check out the post related to SELECT in Spark DataFrame.
Sample Data
empno | ename | designation | manager | hire_date | sal | deptno | location |
9369 | SMITH | CLERK | 7902 | 12/17/1980 | 800 | 20 | BANGALORE |
9499 | ALLEN | SALESMAN | 7698 | 2/20/1981 | 1600 | 30 | HYDERABAD |
9521 | WARD | SALESMAN | 7698 | 2/22/1981 | 1250 | 30 | PUNE |
9566 | TURNER | MANAGER | 7839 | 4/2/1981 | 2975 | 20 | MUMBAI |
9654 | MARTIN | SALESMAN | 7698 | 9/28/1981 | 1250 | 30 | CHENNAI |
9369 | SMITH | CLERK | 7902 | 12/17/1980 | 800 | 20 | KOLKATA |
Solution
Step 1: Load CSV in DataFrame
val empDf = spark.read.option("header", "true").option("inferSchema", "true").csv("/Users/dipak_shaw/bdp/data/emp_data1.csv")
Step 2: SelectExpr in DataFrame
Use Case 1: Add default value to column value in DataFrame
First, performed the expression using SELECT in the dataframe.
scala> empDf.select("*", "sal + 1000").show
It has thrown an error
org.apache.spark.sql.AnalysisException: cannot resolve ‘`sal + 1000`’ given input columns: [manager, deptno, ename, hire_date, sal, location, empno, designation];;‘Project [empno#80, ename#81, designation#82, manager#83, hire_date#84, sal#85, deptno#86, location#87, ‘sal + 1000]+- Relation[empno#80,ename#81,designation#82,manager#83,hire_date#84,sal#85,deptno#86,location#87] csv
Now, use SelectExpr
scala> empDf.selectExpr("*", "sal + 1000").show
Use case 2: Change Column Data Type
val empCastDf = empDf.select("cast (empno as STRING) as empno_str").show
scala> val empCastDf = empDf.selectExpr("cast (empno as STRING) as empno_str") scala> empCastDf.show scala> empCastDf.dtypes
Use Case 3: CASE Statement
scala> empDf.selectExpr("location", "case when location=\"BANGALORE\" then \"TIER 1\" ELSE \"TIER 2\" end AS CityTier").show
org.apache.spark.sql.AnalysisException: cannot resolve ‘`case when location=”BANGALORE” then “TIER 1” ELSE “TIER 2” end AS CityTier`’ given input columns: [manager, deptno, ename, hire_date, sal, location, empno, designation];;
‘Project [location#87, ‘case when location=”BANGALORE” then “TIER 1” ELSE “TIER 2” end AS CityTier]
scala> empDf.selectExpr("location", "case when location=\"BANGALORE\" then \"TIER 1\" ELSE \"TIER 2\" end AS CityTier").show
Wrapping Up
In this post, we have learned when and how to use SelectExpr in Spark DataFrame. Here, we have seen 3 use cases. You can explore more using different expression scenarios.
You can check out the post related to SELECT in Spark DataFrame.