Spark (Page 4)

Requirement In this post, we will convert RDD to Dataframe in Spark with Scala. Solution Approach 1: Using Schema Struct Type //Create RDD: val dummyRDD = sc.parallelize(Seq(                             (“1001”, “Ename1”, “Designation1”, “Manager1”)                            ,(“1003”, “Ename2”, “Designation2”, “Manager2”)                            ,(“1001”, “Ename3”, “Designation3”, “Manager3”)                             )) val schema = StructType( StructField(“empno”, StringType, true) ::Read More →

Requirement In this post, we are going to create an RDD and then will read the content in Spark. Solution //Create RDD: val dummyRDD = sc.parallelize(Seq(                                (“1001”, “Ename1”, “Designation1”, “Manager1”)                                ,(“1003”, “Ename2”, “Designation2”, “Manager2”)                                , (“1001”, “Ename3”, “Designation3”, “Manager3”)                              )) //Read RDD dummyRDD.collect().foreach(println(_)) //Read specific Column: dummyRDD.collect().foreach(data => println(data._1,Read More →

Requirement In this post, we will see how to print RDD content in Pyspark. Solution Let’s take dummy data. We are having 2 rows of Employee data with 7 columns. empData = [(7389, “SMITH”, “CLEARK”, 9902, “2010-12-17”, 8000.00, 20),             (7499, “ALLEN”, “SALESMAN”, 9698, “2011-02-20”, 9000.00, 30)] Load the data intoRead More →

Requirement The CSV file format is a very common file format used in many applications. Sometimes, it contains data with some additional behavior also. For example comma within the value, quotes, multiline, etc. In order to handle this additional behavior, spark provides options to handle it while processing the data.Read More →

Requirement: You have sample dataframe which you want to load in to parquet files using scala.   Solution: Step 1: Sample Dataframe  use below command: spark-shell Note: I am using spark 2.3 version. To Create a sample dataframe , Please refer Create-a-spark-dataframe-from-sample-data After following above post ,you can see thatRead More →

Requirement In this post, we will learn how to convert a table’s schema into a Data Frame in Spark. Sample Data empno ename designation manager hire_date sal deptno location 9369 SMITH CLERK 7902 12/17/1980 800 20 BANGALORE 9499 ALLEN SALESMAN 7698 2/20/1981 1600 30 HYDERABAD 9521 WARD SALESMAN 7698 2/22/1981Read More →

Requirement In this post, we will learn how to handle NULL in spark dataframe. There are multiple ways to handle NULL while data processing. We will see how can we do it in Spark DataFrame. Solution Create Dataframe with dummy data val df = spark.createDataFrame(Seq( (1100, “Person1”, “Location1”, null), (1200,Read More →

Requirement In this post, we will learn how to select a specific column value or all the columns in Spark DataFrame with different approaches. Sample Data empno ename designation manager hire_date sal deptno location 9369 SMITH CLERK 7902 12/17/1980 800 20 BANGALORE 9499 ALLEN SALESMAN 7698 2/20/1981 1600 30 HYDERABADRead More →

Requirement The UDF is a user-defined function. As its name indicate, a user can create a custom function and used it wherever required. We do create UDF when the existing build-in functions not available or not able to fulfill the requirement. Sample Data empno ename designation manager hire_date sal deptnoRead More →