Print RDD in Pyspark

Requirement

In this post, we will see how to print RDD content in Pyspark.

Solution

Let’s take dummy data. We are having 2 rows of Employee data with 7 columns.

 empData = [(7389, "SMITH", "CLEARK", 9902, "2010-12-17", 8000.00, 20), 
            (7499, "ALLEN", "SALESMAN", 9698, "2011-02-20", 9000.00, 30)]

Load the data into an RDD named empRDD using the below command:

 empRDD = spark.sparkContext.parallelize(empData)

Here, empRDD is an RDD type. Let’s read the content of this RDD:

 // Print the RDD content 
for row in empRDD.collect():
      print(row)

Here, using an action COLLECT on the RDD and then iterating on the value.

 

Sharing is caring!

Subscribe to our newsletter
Loading

Leave a Reply