Trim Column in PySpark DataFrame

Requirement

As we received data/files from multiple sources, the chances are high to have issues in the data. Let’s say, we have received a CSV file, and most of the columns are of String data type in the file. We found some data missing in the target table after processing the given file.

We identified that a column having spaces in the data, as a return, it is not behaving correctly in some of the logics like a filter, joins, etc. In this post, we will see how to remove the space of the column data i.e. trim column in PySpark.

Solution

Step 1: Sample Dataframe

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data = [(9369,"SMITH"," CLEARK ","BANGALORE"),
    (9499,"ALLEN","CLEARK","HYDERABAD")]

schema = StructType([ \
    StructField("empno",IntegerType(),True), \
    StructField("ename",StringType(),True), \
    StructField("designation",StringType(),True), \
    StructField("location", StringType(), True), 
  ])
 
df = spark.createDataFrame(data=data,schema=schema)

df.show()

There are only 2 records having the same designation. But if you do the distinct value, you will see 2 records that means both the values are different as there is a space in a value.

 df.select(df['designation']).distinct().show()

Step 2: Trim column of DataFrame

The trim is an inbuild function available. We need to import it using the below command:

from pyspark.sql import functions as fun

for colname in df.columns:
  df = df.withColumn(colname, fun.trim(fun.col(colname)))

df.select(df['designation']).distinct().show()

Here, I have trimmed all the column’s values. When we performed distinct operation, it has given only a single value CLEARK.

Wrapping Up

In this post, we have learned to remove spaces in the column value in the dataframe. We can add this for all the string data types before processing the data.

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply