Updated March 30, 2023
Introduction to PySpark Union DataFrame
The following article provides an outline for PySpark Union DataFrame. It is a transformation function used to merge data frames with the same schema structure; the union method is a merging operation used to merge two or more than two data frames in PySpark. This union is an easy approach for joining and fetching the data over a different data structure. It is a schema-based operation that merges data over schema, and if the schema is not the same, an error is returned. It returns a new data frame as an output, and the data frame contains all the rows from the DataFrames used. It also takes up the duplicate value while merging the data together.
Syntax of PySpark Union DataFrame
Given below is the syntax mentioned:
b1 = spark.createDataFrame(a1)
b = spark.createDataFrame(a)
d = b.union(b1)
The return type has the union function containing data from both the data frame.
Output:
Working of Union DataFrame in PySpark
Given below shows how Union DataFrame works in PySpark:
- The PySpark Union function is a transformation operation that combines all the data in a data frame and stores the data frame element into a new data frame.
- This schema-based data operation merges the data only when the schema for the operation is the same all over the data frame. Thus, it combines results of two or more data frames into a single data frame model.
- The Union operation checks for the Schema first and merges the data accordingly; if the schema is not the same, the data frame union operation will not merge the data. This is because the union operation contains duplicate data while operating over and merging data. Thus, there is no shuffling operation that happens over a union of data.
Example of PySpark Union DataFrame
Different examples are mentioned below:
First, let’s start by creating a sample data frame in PySpark.
data1 = [{'Name':'Jhon','Sal':25000,'Add':'USA'},{'Name':'Joe','Sal':30000,'Add':'USA'},{'Name':'Tina','Sal':22000,'Add':'IND'},{'Name':'Jhon','Sal':15000,'Add':'USA'}]
The data contains Name, Salary and Address that will be used as sample data for Data frame creation.
a = sc.parallelize(data1)
The sc.parallelize will be used for the creation of RDD with the given Data.
b = spark.createDataFrame(a)
Post creation, we will use the createDataFrame method for the creation of Data Frame.
This is how the Data Frame looks.
b.show()
Output:
Data2 = [{'Name':'Jack','Sal':35333,'Add':'USA'},{'Name':'Jin','Sal':50000,'Add':'IND'},{'Name':'Tina','Sal':22050,'Add':'IND'},{'Name':'Jhon','Sal':15000,'Add':'USA'}]
The data contains Name, Salary and Address that will be used as sample data for Data frame creation.
a1 = sc.parallelize(data1)
The sc.parallelize will be used for the creation of RDD with the given Data.
b1 = spark.createDataFrame(a)
b1.show()
Output:
The union operation over there and see the type of data we are getting post doing a Union operation.
Code:
d = b.union(b1)
d.show()
Output:
Here we can see that the union operation is merging the data with the data frame, and data is returned over a new data frame. As we can see, we can get duplicate data also by doing a union operation that can be removed by using the drop duplicate function or distinct with PySpark. Post creation, we will use the createDataFrame method for the creation of Data Frame.
Conclusion
From the above article, we saw the working of Union DataFrame in PySpark. From various examples and classification, we tried to see how this Union DataFrame function is used in PySpark and what are its use in the programming level. The various methods used showed how it eases the pattern for data analysis and a cost-efficient model for the same. We also saw the internal working and the advantages of having Union in PySpark Data Frame and its usage in various programming purposes. Also, the syntax and examples helped us to understand much precisely over the function.
Recommended Articles
This is a guide to PySpark Union DataFrame. Here we discuss the introduction, working of union DataFrame in PySpark and examples. You may also have a look at the following articles to learn more –