dataframe (Page 2)

Requirement In this post, we will convert RDD to Dataframe in Pyspark. Solution Let’s create dummy data and load it into an RDD. After that, we will convert RDD to Dataframe with a defined schema. # Create RDD empData = [(7389, “SMITH”, “CLEARK”, 9902, “2010-12-17”, 8000.00, 20),             (7499, “ALLEN”, “SALESMAN”,Read More →

Requirement In this post, we will see how to print RDD content in Pyspark. Solution Let’s take dummy data. We are having 2 rows of Employee data with 7 columns. empData = [(7389, “SMITH”, “CLEARK”, 9902, “2010-12-17”, 8000.00, 20),             (7499, “ALLEN”, “SALESMAN”, 9698, “2011-02-20”, 9000.00, 30)] Load the data intoRead More →

Requirement In this post, we will learn how to convert a table’s schema into a Data Frame in Spark. Sample Data empno ename designation manager hire_date sal deptno location 9369 SMITH CLERK 7902 12/17/1980 800 20 BANGALORE 9499 ALLEN SALESMAN 7698 2/20/1981 1600 30 HYDERABAD 9521 WARD SALESMAN 7698 2/22/1981Read More →

Requirement In this post, we will learn how to handle NULL in spark dataframe. There are multiple ways to handle NULL while data processing. We will see how can we do it in Spark DataFrame. Solution Create Dataframe with dummy data val df = spark.createDataFrame(Seq( (1100, “Person1”, “Location1”, null), (1200,Read More →

Requirement In this post, we will learn how to select a specific column value or all the columns in Spark DataFrame with different approaches. Sample Data empno ename designation manager hire_date sal deptno location 9369 SMITH CLERK 7902 12/17/1980 800 20 BANGALORE 9499 ALLEN SALESMAN 7698 2/20/1981 1600 30 HYDERABADRead More →

Requirement Let’s say we are getting data from two different sources (i.e. RDBMS table and File), and we need to merge these data into a single dataframe. Both the source data having the same schema.  Sample Data MySQL Table Data: empno,ename,designation,manager,hire_date,sal,deptno 7369,SMITH,CLERK,7902,12/17/1980,800,20 7499,ALLEN,SALESMAN,7698,2/20/1981,1600,30 7521,WARD,SALESMAN,7698,2/22/1981,1250,30 7566,TURNER,MANAGER,7839,4/2/1981,2975,20 7654,MARTIN,SALESMAN,7698,9/28/1981,1250,30 CSV File Data: empno,ename,designation,manager,hire_date,sal,deptnoRead More →

Requirement Let’s take a scenario where we have already loaded data into an RDD/Dataframe. We got the rows data into columns and columns data into rows. The requirement is to transpose the data i.e. change rows into columns and columns into rows. Sample Data We will use below sample data.Read More →