Requirement
In this post, we will see how to print RDD content in Pyspark.
Solution
Let’s take dummy data. We are having 2 rows of Employee data with 7 columns.
empData = [(7389, "SMITH", "CLEARK", 9902, "2010-12-17", 8000.00, 20), (7499, "ALLEN", "SALESMAN", 9698, "2011-02-20", 9000.00, 30)]
Load the data into an RDD named empRDD using the below command:
empRDD = spark.sparkContext.parallelize(empData)
Here, empRDD is an RDD type. Let’s read the content of this RDD:
// Print the RDD content for row in empRDD.collect(): print(row)
Here, using an action COLLECT on the RDD and then iterating on the value.