Convert RDD to Dataframe in Pyspark

Requirement

In this post, we will convert RDD to Dataframe in Pyspark.

Solution

Let’s create dummy data and load it into an RDD. After that, we will convert RDD to Dataframe with a defined schema.

# Create RDD
empData = [(7389, "SMITH", "CLEARK", 9902, "2010-12-17", 8000.00, 20), 
            (7499, "ALLEN", "SALESMAN", 9698, "2011-02-20", 9000.00, 30)]
empRDD = spark.sparkContext.parallelize(empData)
# Print the RDD content 
for row in empRDD.collect():
print(row)

Define column names:

 # Define a columns name
cols = ["Empno", "Empname", "Designation", "Manager", "Hire_date", "Sal", "Deptno"]

Convert RDD to Dataframe:

 empDF = empRDD.toDF(cols)

Here is the full code:

empData = [(7389, "SMITH", "CLEARK", 9902, "2010-12-17", 8000.00, 20), 
            (7499, "ALLEN", "SALESMAN", 9698, "2011-02-20", 9000.00, 30)]
empRDD = spark.sparkContext.parallelize(empData)

# Print the RDD content 
for row in empRDD.collect():
      print(row)

# Define a columns name
cols = ["Empno", "Empname", "Designation", "Manager", "Hire_date", "Sal", "Deptno"]

# Convert RDD to Dataframe
empDF = empRDD.toDF(cols)
empDF.show()

We can also convert RDD to Dataframe using the below command:

 empDF2 = spark.createDataFrame(empRDD).toDF(*cols)

Wrapping Up

We can define the column’s name while converting the RDD to Dataframe. It is good for understanding the column. If not passing any column, then it will create the dataframe with default naming convention like _0, _1, _2, etc.

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply