RDD

Requirement In this post, we will convert RDD to Dataframe in Spark with Scala. Solution Approach 1: Using Schema Struct Type //Create RDD: val dummyRDD = sc.parallelize(Seq(                             (“1001”, “Ename1”, “Designation1”, “Manager1”)                            ,(“1003”, “Ename2”, “Designation2”, “Manager2”)                            ,(“1001”, “Ename3”, “Designation3”, “Manager3”)                             )) val schema = StructType( StructField(“empno”, StringType, true) ::Read More →

Requirement In this post, we are going to create an RDD and then will read the content in Spark. Solution //Create RDD: val dummyRDD = sc.parallelize(Seq(                                (“1001”, “Ename1”, “Designation1”, “Manager1”)                                ,(“1003”, “Ename2”, “Designation2”, “Manager2”)                                , (“1001”, “Ename3”, “Designation3”, “Manager3”)                              )) //Read RDD dummyRDD.collect().foreach(println(_)) //Read specific Column: dummyRDD.collect().foreach(data => println(data._1,Read More →

Requirement In this post, we will convert RDD to Dataframe in Pyspark. Solution Let’s create dummy data and load it into an RDD. After that, we will convert RDD to Dataframe with a defined schema. # Create RDD empData = [(7389, “SMITH”, “CLEARK”, 9902, “2010-12-17”, 8000.00, 20),             (7499, “ALLEN”, “SALESMAN”,Read More →

Requirement In this post, we will see how to print RDD content in Pyspark. Solution Let’s take dummy data. We are having 2 rows of Employee data with 7 columns. empData = [(7389, “SMITH”, “CLEARK”, 9902, “2010-12-17”, 8000.00, 20),             (7499, “ALLEN”, “SALESMAN”, 9698, “2011-02-20”, 9000.00, 30)] Load the data intoRead More →