dataframe

How to Use Accumulators in Spark?

In this post, we will explore how to use accumulators in Apache Spark to aggregate values across distributed tasks. Accumulators provide a way to collect and update values from worker nodes to the driver node efficiently. Problem Statement We want to collect and aggregate values from distributed tasks in SparkRead More →

How to use broadcast variables in Spark?

Tagged: broadcast, dataframe, Spark

In this post, we will explore how to use broadcast variables in Apache Spark to efficiently share small lookup tables or variables across distributed tasks. Broadcast variables can significantly improve the performance of Spark jobs by reducing network transfer and memory consumption. Problem Statement We want to optimize Spark jobsRead More →

How to create a Spark Streaming application

Tagged: dataframe, kafka, Spark, spark streaming

In this post, we will explore how to create a Spark Streaming application. Spark Streaming is a powerful component of Apache Spark that enables real-time processing and analysis of streaming data. Problem Statement We want to develop a real-time application that can process and analyze streaming data from a dataRead More →

How to pivot data in a Spark DataFrame?

Tagged: dataframe, pivot, Spark

In this post, we will explore how to pivot data in a Spark DataFrame. Pivoting is a powerful operation that allows us to restructure our data by transforming rows into columns. Problem Given a Spark DataFrame containing sales data, we want to pivot the data to have product categories asRead More →

How to aggregate data in a Spark DataFrame?

Tagged: aggregate, dataframe, Spark

In this post, we will explore how to aggregate data in a Spark DataFrame. Aggregation is a crucial operation in data analysis and processing, allowing us to summarize and derive insights from large datasets. Problem Given a Spark DataFrame containing sales data, we want to aggregate the data to calculateRead More →

Read Nested JSON in Spark DataFrame

Tagged: databricks, dataframe, json, nested json, Spark

Requirement In our Read JSON file in Spark post, we have read a simple JSON file into a Spark Dataframe. In this post, we are moving to handle an advanced JSON data type. We will read nested JSON in spark Dataframe. Sample Data We are going to use below sampleRead More →

Check If DataFrame is Empty in Spark

Tagged: dataframe, empty

Requirement In this post, we are going to learn how to check if Dataframe is Empty in Spark. This is a very important part of the development as this condition actually decides whether the transformation logic will execute on the Dataframe or not. Solution Let’s first understand how this canRead More →

Create Delta Table from JSON File in Databricks

Tagged: databricks, dataframe, json, Spark

Requirement In this post, we are going to read a JSON file using Spark and then load it into a Delta table in Databricks. Solution We can use the below sample data for the exercise. In our case, we have placed this file is located in FilteStore in the DatabricksRead More →

Create Delta Table with Partition from CSV File in Databricks

Tagged: csv, databricks, dataframe, Delta Table, partition, scala, Spark

Requirement In the last post, we have imported the CSV file and created a delta table in Databricks. In this post, we are extending the same exercise with the Partition. Partitioning is a very important concept when it comes to storing the data in tables. In this post, we areRead More →

Load TSV file in Spark

Tagged: dataframe, Spark, tsv

Requirement Let’s say we have a data file with a TSV extension. It is the same as the CSV file. What is the difference between CSV and TSV? The difference is separating the data in the file The CSV file stores data separated by “,”, whereas TSV stores data separatedRead More →

Trim Column in PySpark DataFrame

Tagged: dataframe, distinct, pyspark, Spark, trim

Requirement As we received data/files from multiple sources, the chances are high to have issues in the data. Let’s say, we have received a CSV file, and most of the columns are of String data type in the file. We found some data missing in the target table after processingRead More →

Add multiple columns in spark dataframe

Tagged: add column, Add multiple columns in spark dataframe, dataframe, new columns, RDD, Spark, spark new columns, withColumn

Requirement: There is spark dataframe, in which it is needed to add multiple columns altogether, without writing the withColumn , multiple times, As you are not sure, how many columns would be available. Solution : Step 1: A spark Dataframe Enter into your spark-shell , and create a sample dataframe,Read More →

Big Data Engineering Interview Questions

Tagged: Azure, csv, data engineering, dataframe, google cloud, hbase, hive, kafka, schema, Spark, struct

1. If there is a csv file present in hdfs location which has a header . while reading it in to spark, which property needs to be set Answer : while reading in dataframe we need to set an option of header to true like below :- var df1=spark.read.option(“header”,true).csv(“path”) Read More →

Convert RDD to Dataframe in Spark

Tagged: convert rdd to dataframe, dataframe, parallelize, RDD, sc, seq, Spark

Requirement In this post, we will convert RDD to Dataframe in Spark with Scala. Solution Approach 1: Using Schema Struct Type //Create RDD: val dummyRDD = sc.parallelize(Seq( (“1001”, “Ename1”, “Designation1”, “Manager1”) ,(“1003”, “Ename2”, “Designation2”, “Manager2”) ,(“1001”, “Ename3”, “Designation3”, “Manager3”) )) val schema = StructType( StructField(“empno”, StringType, true) ::Read More →

Create an Empty DataFrame in Spark

Tagged: createDataFrame, dataframe, empty dataframe, schema, Spark

Requirement In this post, we are going to learn how to create an empty dataframe in Spark with and without schema. Prerequisite Spark 2.x or above Solution We will see create an empty DataFrame with different approaches: PART I: Empty DataFrame with Schema Approach 1:Using createDataFrame Function import org.apache.spark.sql.types.{StructType, StructField,Read More →