Requirement

In the previous post, we have learned about when and how to use SELECT in DataFrame. It is useful when we want to select a column, all columns of a DataFrames. Let’s say we want to add any expression in the query like length, case statement, etc, then SELECT will not be able to fulfill the requirement. There is am another option SELECTExpr. Here, In this post, we are going to learn SelectExpr in DataFrames.

You can check out the post related to SELECT in Spark DataFrame.

Sample Data

empno	ename	designation	manager	hire_date	sal	deptno	location
9369	SMITH	CLERK	7902	12/17/1980	800	20	BANGALORE
9499	ALLEN	SALESMAN	7698	2/20/1981	1600	30	HYDERABAD
9521	WARD	SALESMAN	7698	2/22/1981	1250	30	PUNE
9566	TURNER	MANAGER	7839	4/2/1981	2975	20	MUMBAI
9654	MARTIN	SALESMAN	7698	9/28/1981	1250	30	CHENNAI
9369	SMITH	CLERK	7902	12/17/1980	800	20	KOLKATA

Solution

Step 1: Load CSV in DataFrame

val empDf = spark.read.option("header", "true").option("inferSchema", "true").csv("/Users/dipak_shaw/bdp/data/emp_data1.csv")

Step 2: SelectExpr in DataFrame

Use Case 1: Add default value to column value in DataFrame

First, performed the expression using SELECT in the dataframe.

scala> empDf.select("*", "sal + 1000").show

It has thrown an error

org.apache.spark.sql.AnalysisException: cannot resolve ‘`sal + 1000`’ given input columns: [manager, deptno, ename, hire_date, sal, location, empno, designation];;‘Project [empno#80, ename#81, designation#82, manager#83, hire_date#84, sal#85, deptno#86, location#87, ‘sal + 1000]+- Relation[empno#80,ename#81,designation#82,manager#83,hire_date#84,sal#85,deptno#86,location#87] csv

Now, use SelectExpr

 scala> empDf.selectExpr("*", "sal + 1000").show

Use case 2: Change Column Data Type

val empCastDf = empDf.select("cast (empno as STRING) as empno_str").show

scala> val empCastDf = empDf.selectExpr("cast (empno as STRING) as empno_str")
scala> empCastDf.show
scala> empCastDf.dtypes

Use Case 3: CASE Statement

scala> empDf.selectExpr("location", "case when location=\"BANGALORE\" then \"TIER 1\" ELSE \"TIER 2\" end AS CityTier").show

org.apache.spark.sql.AnalysisException: cannot resolve ‘`case when location=”BANGALORE” then “TIER 1” ELSE “TIER 2” end AS CityTier`’ given input columns: [manager, deptno, ename, hire_date, sal, location, empno, designation];;

‘Project [location#87, ‘case when location=”BANGALORE” then “TIER 1” ELSE “TIER 2” end AS CityTier]

scala> empDf.selectExpr("location", "case when location=\"BANGALORE\" then \"TIER 1\" ELSE \"TIER 2\" end AS CityTier").show

Wrapping Up

In this post, we have learned when and how to use SelectExpr in Spark DataFrame. Here, we have seen 3 use cases. You can explore more using different expression scenarios.

You can check out the post related to SELECT in Spark DataFrame.

SELECTExpr in Spark DataFrame