Requirement

As we received data/files from multiple sources, the chances are high to have issues in the data. Let’s say, we have received a CSV file, and most of the columns are of String data type in the file. We found some data missing in the target table after processing the given file.

We identified that a column having spaces in the data, as a return, it is not behaving correctly in some of the logics like a filter, joins, etc. In this post, we will see how to remove the space of the column data i.e. trim column in PySpark.

Solution

Step 1: Sample Dataframe

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data = [(9369,"SMITH"," CLEARK ","BANGALORE"),
    (9499,"ALLEN","CLEARK","HYDERABAD")]

schema = StructType([ \
    StructField("empno",IntegerType(),True), \
    StructField("ename",StringType(),True), \
    StructField("designation",StringType(),True), \
    StructField("location", StringType(), True), 
  ])
 
df = spark.createDataFrame(data=data,schema=schema)

df.show()

There are only 2 records having the same designation. But if you do the distinct value, you will see 2 records that means both the values are different as there is a space in a value.

 df.select(df['designation']).distinct().show()

Step 2: Trim column of DataFrame

The trim is an inbuild function available. We need to import it using the below command:

from pyspark.sql import functions as fun

for colname in df.columns:
  df = df.withColumn(colname, fun.trim(fun.col(colname)))

df.select(df['designation']).distinct().show()

Here, I have trimmed all the column’s values. When we performed distinct operation, it has given only a single value CLEARK.

Wrapping Up

In this post, we have learned to remove spaces in the column value in the dataframe. We can add this for all the string data types before processing the data.

Trim Column in PySpark DataFrame

Requirement

Solution

Step 1: Sample Dataframe

Step 2: Trim column of DataFrame

Wrapping Up

Leave a Reply Cancel reply

Load JSON Data in Hive non-partitioned table using Spark

Load JSON Data into Hive Partitioned table using PySpark

Load Text file into Hive Table Using Spark

How to create spark application in IntelliJ

Transpose Data in Spark DataFrame using PySpark

How to calculate Rank in dataframe using python with example

Join in pyspark with example

how to delete column in spark dataframe

Print RDD in Pyspark

Convert RDD to Dataframe in Pyspark

Trim Column in PySpark DataFrame

How to aggregate data in a Spark DataFrame?

How to pivot data in a Spark DataFrame?

How to create a Spark Streaming application

How to Write Data to Kafka in Spark Streaming

How to Perform Sliding Window Operations in Spark Streaming?

How to perform stateful operations in Spark Streaming?

How to handle data backpressure in Spark Streaming?

How to cache data in Spark SQL?

How to use broadcast variables in Spark?

How to Use Accumulators in Spark?

How to optimize Spark SQL queries?

How to Use Spark with Cassandra?

Certifications

Top Machine Learning Courses You Shouldn’t Miss

Top courses for data engineers

Top Big Data Courses on Udemy You should Take

Trim Column in PySpark DataFrame

Requirement

Solution

Step 1: Sample Dataframe

Step 2: Trim column of DataFrame

Wrapping Up

Leave a Reply Cancel reply

Tags