Requirement
As we received data/files from multiple sources, the chances are high to have issues in the data. Let’s say, we have received a CSV file, and most of the columns are of String data type in the file. We found some data missing in the target table after processing the given file.
We identified that a column having spaces in the data, as a return, it is not behaving correctly in some of the logics like a filter, joins, etc. In this post, we will see how to remove the space of the column data i.e. trim column in PySpark.
Solution
Step 1: Sample Dataframe
from pyspark.sql.types import StructType,StructField, StringType, IntegerType data = [(9369,"SMITH"," CLEARK ","BANGALORE"), (9499,"ALLEN","CLEARK","HYDERABAD")] schema = StructType([ \ StructField("empno",IntegerType(),True), \ StructField("ename",StringType(),True), \ StructField("designation",StringType(),True), \ StructField("location", StringType(), True), ]) df = spark.createDataFrame(data=data,schema=schema) df.show()
There are only 2 records having the same designation. But if you do the distinct value, you will see 2 records that means both the values are different as there is a space in a value.
df.select(df['designation']).distinct().show()
Step 2: Trim column of DataFrame
The trim is an inbuild function available. We need to import it using the below command:
from pyspark.sql import functions as fun for colname in df.columns: df = df.withColumn(colname, fun.trim(fun.col(colname))) df.select(df['designation']).distinct().show()
Here, I have trimmed all the column’s values. When we performed distinct operation, it has given only a single value CLEARK.
Wrapping Up
In this post, we have learned to remove spaces in the column value in the dataframe. We can add this for all the string data types before processing the data.