Spark (Page 4)

Create Mount Point in Azure Databricks Using Service Principal and OAuth

Tagged: Azure, cloud, databricks, key vault, Mount, mount in databricks, OAuth, Secret Scope, Spark, SPN

Requirement In this post, we are going to create a mount point in Azure Databricks to access the Azure Datalake data. This is a one-time activity. Once we create the mount point of blob storage, we can directly use this mount point to access the files. Earlier, in one ofRead More →

filter a Map in spark using scala

Tagged: filter a map, map in scala, scala, scala tutorials, Spark, spark tutorials

Requirement : There is a Map in scala which is being used by one of your application ,which have all the configurations for your spark app. you need to remove some of conf in Map. Solution : Step 1: Creation of Map Map is the combination of key value pair. Read More →

Convert RDD to Dataframe in Spark

Tagged: convert rdd to dataframe, dataframe, parallelize, RDD, sc, seq, Spark

Requirement In this post, we will convert RDD to Dataframe in Spark with Scala. Solution Approach 1: Using Schema Struct Type //Create RDD: val dummyRDD = sc.parallelize(Seq( (“1001”, “Ename1”, “Designation1”, “Manager1”) ,(“1003”, “Ename2”, “Designation2”, “Manager2”) ,(“1001”, “Ename3”, “Designation3”, “Manager3”) )) val schema = StructType( StructField(“empno”, StringType, true) ::Read More →

Print RDD content in Spark

Tagged: collect, foreach, parallelize, RDD, Spark

Requirement In this post, we are going to create an RDD and then will read the content in Spark. Solution //Create RDD: val dummyRDD = sc.parallelize(Seq( (“1001”, “Ename1”, “Designation1”, “Manager1”) ,(“1003”, “Ename2”, “Designation2”, “Manager2”) , (“1001”, “Ename3”, “Designation3”, “Manager3”) )) //Read RDD dummyRDD.collect().foreach(println(_)) //Read specific Column: dummyRDD.collect().foreach(data => println(data._1,Read More →

Create an Empty DataFrame in Spark

Tagged: createDataFrame, dataframe, empty dataframe, schema, Spark

Requirement In this post, we are going to learn how to create an empty dataframe in Spark with and without schema. Prerequisite Spark 2.x or above Solution We will see create an empty DataFrame with different approaches: PART I: Empty DataFrame with Schema Approach 1:Using createDataFrame Function import org.apache.spark.sql.types.{StructType, StructField,Read More →

Print RDD in Pyspark

Tagged: dataframe, RDD, read rdd, Spark

Requirement In this post, we will see how to print RDD content in Pyspark. Solution Let’s take dummy data. We are having 2 rows of Employee data with 7 columns. empData = [(7389, “SMITH”, “CLEARK”, 9902, “2010-12-17”, 8000.00, 20), (7499, “ALLEN”, “SALESMAN”, 9698, “2011-02-20”, 9000.00, 30)] Load the data intoRead More →

Read Options in Spark

Tagged: csv, options in spark, Spark

Requirement The CSV file format is a very common file format used in many applications. Sometimes, it contains data with some additional behavior also. For example comma within the value, quotes, multiline, etc. In order to handle this additional behavior, spark provides options to handle it while processing the data.Read More →

Spark Interview Questions – Part 2

Tagged: big data engineer, big data job preparation, interview preparation, Spark, spark interview questions

Q1 : What is the use of query.awaitTermination() In structured streaming? Answer : In batch processing, we generally load the whole data and store the data at once. But in real time streaming, we get data in micro batches mostly based on trigger processing time, hence the streaming query shouldRead More →

Create Databricks Cluster in Azure

Tagged: Azure, databricks, dataframe, notebook, pyspark, Spark, sparksql

Requirement Do you want to explore Spark? Azure provides a cloud service platform named databricks which is built on top of the Spark. In this post, we are going to create a databricks cluster in Azure. Solution Follow the below steps to create the databricks cluster in Azure. Step 1:Read More →

Write spark dataframe into Parquet files using scala

Tagged: apache spark, big data, hadoop, scala, Spark, spark dataframe, spark tutorials, Write spark dataframe into Parquet files using scala, write to parquet

Requirement: You have sample dataframe which you want to load in to parquet files using scala. Solution: Step 1: Sample Dataframe use below command: spark-shell Note: I am using spark 2.3 version. To Create a sample dataframe , Please refer Create-a-spark-dataframe-from-sample-data After following above post ,you can see thatRead More →

Convert Schema to DataFrame in Spark

Tagged: dataframe, json, schema, Spark

Requirement In this post, we will learn how to convert a table’s schema into a Data Frame in Spark. Sample Data empno ename designation manager hire_date sal deptno location 9369 SMITH CLERK 7902 12/17/1980 800 20 BANGALORE 9499 ALLEN SALESMAN 7698 2/20/1981 1600 30 HYDERABAD 9521 WARD SALESMAN 7698 2/22/1981Read More →

NULLs in Spark DataFrame

Tagged: dataframe, na, null, Spark

Requirement In this post, we will learn how to handle NULL in spark dataframe. There are multiple ways to handle NULL while data processing. We will see how can we do it in Spark DataFrame. Solution Create Dataframe with dummy data val df = spark.createDataFrame(Seq( (1100, “Person1”, “Location1”, null), (1200,Read More →

SELECT in Spark DataFrame

Tagged: column in dataframe, dataframe, select, select all column in dataframe, select column in dataframe, select in dataframe, Spark, sparksql

Requirement In this post, we will learn how to select a specific column value or all the columns in Spark DataFrame with different approaches. Sample Data empno ename designation manager hire_date sal deptno location 9369 SMITH CLERK 7902 12/17/1980 800 20 BANGALORE 9499 ALLEN SALESMAN 7698 2/20/1981 1600 30 HYDERABADRead More →

UDF in Spark

Tagged: scala, Spark, udf

Requirement The UDF is a user-defined function. As its name indicate, a user can create a custom function and used it wherever required. We do create UDF when the existing build-in functions not available or not able to fulfill the requirement. Sample Data empno ename designation manager hire_date sal deptnoRead More →

Read CSV File With New Line in Spark

Tagged: big data, csv file, data frame, multiline delimiter, Spark, spark dataframe

Requirement The CSV file is a very common source file to get data. Sometimes the issue occurs while processing this file. It can be because of multiple reasons. Here, in this post, we are going to discuss an issue – NEW LINE Character. In this demonstration, first, we will understandRead More →