Requirement
In this post, we will convert RDD to Dataframe in Pyspark.
Solution
Let’s create dummy data and load it into an RDD. After that, we will convert RDD to Dataframe with a defined schema.
# Create RDD empData = [(7389, "SMITH", "CLEARK", 9902, "2010-12-17", 8000.00, 20), (7499, "ALLEN", "SALESMAN", 9698, "2011-02-20", 9000.00, 30)] empRDD = spark.sparkContext.parallelize(empData)
# Print the RDD content for row in empRDD.collect(): print(row)
Define column names:
# Define a columns name cols = ["Empno", "Empname", "Designation", "Manager", "Hire_date", "Sal", "Deptno"]
Convert RDD to Dataframe:
empDF = empRDD.toDF(cols)
Here is the full code:
empData = [(7389, "SMITH", "CLEARK", 9902, "2010-12-17", 8000.00, 20), (7499, "ALLEN", "SALESMAN", 9698, "2011-02-20", 9000.00, 30)] empRDD = spark.sparkContext.parallelize(empData) # Print the RDD content for row in empRDD.collect(): print(row) # Define a columns name cols = ["Empno", "Empname", "Designation", "Manager", "Hire_date", "Sal", "Deptno"] # Convert RDD to Dataframe empDF = empRDD.toDF(cols) empDF.show()
We can also convert RDD to Dataframe using the below command:
empDF2 = spark.createDataFrame(empRDD).toDF(*cols)
Wrapping Up
We can define the column’s name while converting the RDD to Dataframe. It is good for understanding the column. If not passing any column, then it will create the dataframe with default naming convention like _0, _1, _2, etc.