spark with scala

In this post, we will explore how to read data from Apache Kafka in a Spark Streaming application. Apache Kafka is a distributed streaming platform that provides a reliable and scalable way to publish and subscribe to streams of records. Problem Statement We want to develop a Spark Streaming applicationRead More →

Requirement In this post, we are going to learn how to check if Dataframe is Empty in Spark. This is a very important part of the development as this condition actually decides whether the transformation logic will execute on the Dataframe or not. Solution Let’s first understand how this canRead More →

Requirement Let’s say we have a data file with a TSV extension. It is the same as the CSV file. What is the difference between CSV and TSV? The difference is separating the data in the file The CSV file stores data separated by “,”, whereas TSV stores data separatedRead More →

Requirement In this post, we will learn how to get last element in list of dataframe in spark. Solution Create a dataframe with dummy data: val df = spark.createDataFrame(Seq( (“1100”, “Person1”, “Street1#Location1#City1”, null), (“1200”, “Person2”, “Street2#Location2#City2”, “Contact2”), (“1300”, “Person3”, “Street3#Location3#City3”, null), (“1400”, “Person4”, null, “Contact4”), (“1500”, “Person5”, “Street5#Location5#City5”, null) )).toDF(“id”,Read More →

Requirement In this post, we will learn how to get or extract a value from a row. Whenever we extract a value from a row of a column, we get an object as a result. For example, if we have a data frame with personal details like id, name, location,Read More →

Requirement In this post, we are going to learn about how to compare data frames data in Spark. Let’s see a scenario where your daily job consumes data from the source system and append it into the target table as it is a Delta/Incremental load. There is a possibility toRead More →

Requirement In this post, we will convert RDD to Dataframe in Spark with Scala. Solution Approach 1: Using Schema Struct Type //Create RDD: val dummyRDD = sc.parallelize(Seq(                             (“1001”, “Ename1”, “Designation1”, “Manager1”)                            ,(“1003”, “Ename2”, “Designation2”, “Manager2”)                            ,(“1001”, “Ename3”, “Designation3”, “Manager3”)                             )) val schema = StructType( StructField(“empno”, StringType, true) ::Read More →

Requirement In this post, we are going to create an RDD and then will read the content in Spark. Solution //Create RDD: val dummyRDD = sc.parallelize(Seq(                                (“1001”, “Ename1”, “Designation1”, “Manager1”)                                ,(“1003”, “Ename2”, “Designation2”, “Manager2”)                                , (“1001”, “Ename3”, “Designation3”, “Manager3”)                              )) //Read RDD dummyRDD.collect().foreach(println(_)) //Read specific Column: dummyRDD.collect().foreach(data => println(data._1,Read More →