Requirement

In this post, we will convert RDD to Dataframe in Pyspark.

Solution

Let’s create dummy data and load it into an RDD. After that, we will convert RDD to Dataframe with a defined schema.

# Create RDD
empData = [(7389, "SMITH", "CLEARK", 9902, "2010-12-17", 8000.00, 20), 
            (7499, "ALLEN", "SALESMAN", 9698, "2011-02-20", 9000.00, 30)]
empRDD = spark.sparkContext.parallelize(empData)

# Print the RDD content 
for row in empRDD.collect():
print(row)

Define column names:

 # Define a columns name
cols = ["Empno", "Empname", "Designation", "Manager", "Hire_date", "Sal", "Deptno"]

Convert RDD to Dataframe:

 empDF = empRDD.toDF(cols)

Here is the full code:

empData = [(7389, "SMITH", "CLEARK", 9902, "2010-12-17", 8000.00, 20), 
            (7499, "ALLEN", "SALESMAN", 9698, "2011-02-20", 9000.00, 30)]
empRDD = spark.sparkContext.parallelize(empData)

# Print the RDD content 
for row in empRDD.collect():
      print(row)

# Define a columns name
cols = ["Empno", "Empname", "Designation", "Manager", "Hire_date", "Sal", "Deptno"]

# Convert RDD to Dataframe
empDF = empRDD.toDF(cols)
empDF.show()

We can also convert RDD to Dataframe using the below command:

 empDF2 = spark.createDataFrame(empRDD).toDF(*cols)

Wrapping Up

We can define the column’s name while converting the RDD to Dataframe. It is good for understanding the column. If not passing any column, then it will create the dataframe with default naming convention like _0, _1, _2, etc.

Convert RDD to Dataframe in Pyspark

Requirement

Solution

Wrapping Up

Leave a Reply Cancel reply

Load JSON Data in Hive non-partitioned table using Spark

Load JSON Data into Hive Partitioned table using PySpark

Load Text file into Hive Table Using Spark

How to create spark application in IntelliJ

Transpose Data in Spark DataFrame using PySpark

How to calculate Rank in dataframe using python with example

Join in pyspark with example

how to delete column in spark dataframe

Print RDD in Pyspark

Convert RDD to Dataframe in Pyspark

Trim Column in PySpark DataFrame

How to aggregate data in a Spark DataFrame?

How to pivot data in a Spark DataFrame?

How to create a Spark Streaming application

How to Write Data to Kafka in Spark Streaming

How to Perform Sliding Window Operations in Spark Streaming?

How to perform stateful operations in Spark Streaming?

How to handle data backpressure in Spark Streaming?

How to cache data in Spark SQL?

How to use broadcast variables in Spark?

How to Use Accumulators in Spark?

How to optimize Spark SQL queries?

How to Use Spark with Cassandra?

Certifications

Top Machine Learning Courses You Shouldn’t Miss

Top courses for data engineers

Top Big Data Courses on Udemy You should Take

Convert RDD to Dataframe in Pyspark

Requirement

Solution

Wrapping Up

Leave a Reply Cancel reply

Tags