SELECTExpr in Spark DataFrame

Requirement

In the previous post, we have learned about when and how to use SELECT in DataFrame. It is useful when we want to select a column, all columns of a DataFrames. Let’s say we want to add any expression in the query like length, case statement, etc, then SELECT will not be able to fulfill the requirement. There is am another option SELECTExpr. Here, In this post, we are going to learn SelectExpr in DataFrames.

You can check out the post related to SELECT in Spark DataFrame.

Sample Data

empno

ename

designation

manager

hire_date

sal

deptno

location

9369

SMITH

CLERK

7902

12/17/1980

800

20

BANGALORE

9499

ALLEN

SALESMAN

7698

2/20/1981

1600

30

HYDERABAD

9521

WARD

SALESMAN

7698

2/22/1981

1250

30

PUNE

9566

TURNER

MANAGER

7839

4/2/1981

2975

20

MUMBAI

9654

MARTIN

SALESMAN

7698

9/28/1981

1250

30

CHENNAI

9369

SMITH

CLERK

7902

12/17/1980

800

20

KOLKATA

Solution

Step 1: Load CSV in DataFrame

val empDf = spark.read.option("header", "true").option("inferSchema", "true").csv("/Users/dipak_shaw/bdp/data/emp_data1.csv")

Step 2: SelectExpr in DataFrame

Use Case 1: Add default value to column value in DataFrame

First, performed the expression using SELECT in the dataframe.

scala> empDf.select("*", "sal + 1000").show

It has thrown an error

org.apache.spark.sql.AnalysisException: cannot resolve ‘`sal + 1000`’ given input columns: [manager, deptno, ename, hire_date, sal, location, empno, designation];;‘Project [empno#80, ename#81, designation#82, manager#83, hire_date#84, sal#85, deptno#86, location#87, ‘sal + 1000]+- Relation[empno#80,ename#81,designation#82,manager#83,hire_date#84,sal#85,deptno#86,location#87] csv

Now, use SelectExpr

 scala> empDf.selectExpr("*", "sal + 1000").show

Use case 2: Change Column Data Type

val empCastDf = empDf.select("cast (empno as STRING) as empno_str").show

scala> val empCastDf = empDf.selectExpr("cast (empno as STRING) as empno_str")
scala> empCastDf.show
scala> empCastDf.dtypes

Use Case 3: CASE Statement

scala> empDf.selectExpr("location", "case when location=\"BANGALORE\" then \"TIER 1\" ELSE \"TIER 2\" end AS CityTier").show

org.apache.spark.sql.AnalysisException: cannot resolve ‘`case when location=”BANGALORE” then “TIER 1” ELSE “TIER 2” end AS CityTier`’ given input columns: [manager, deptno, ename, hire_date, sal, location, empno, designation];;

‘Project [location#87, ‘case when location=”BANGALORE” then “TIER 1” ELSE “TIER 2” end AS CityTier]

scala> empDf.selectExpr("location", "case when location=\"BANGALORE\" then \"TIER 1\" ELSE \"TIER 2\" end AS CityTier").show

Wrapping Up

In this post, we have learned when and how to use SelectExpr in Spark DataFrame. Here, we have seen 3 use cases.  You can explore more using different expression scenarios.

You can check out the post related to SELECT in Spark DataFrame.

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply