Updated March 31, 2023
Introduction to PySpark GroupBy Sum
The following article provides an outline for PySpark GroupBy Sum. PySpark GroupBy is a Grouping function in the PySpark data model that uses some columnar values to group rows together. This works on the model of grouping Data based on some columnar conditions and aggregating the data as the final result. It is a GroupBy function with an aggregate function as Sum that groups and sums data based on some columnar data value. This returns the result as the sum of the column by grouping the data together; this is an important function in PySpark that is used for the summation of data needed for data analysis.
Syntax of PySpark GroupBy Sum
Given below is the syntax mentioned:
Df2 = b.groupBy("Name").sum("Sal")
- b: The data frame created for PySpark.
- groupBy(): The Group By function that needs to be called with Aggregate function as Sum().
The Sum function can be taken by passing the column name as a parameter.
- Df2: The new data frame selected after conversion.
Output:
Working of Sum with GroupBy in PySpark
- The GroupBy function follows the method of Key value that operates over the PySpark RDD/Data frame model. Same Key Data are shuffled using the partitions and are brought together being grouped over a partition.
- The shuffling operation is used for the movement of data for grouping. The same key elements are grouped, and the value is returned. The aggregate function sum is used to sum the grouped function over the column value, and the result is then returned.
- The function sums up all the grouped column data together, and the result is returned.
Examples of PySpark GroupBy Sum
Different examples are mentioned below:
Let’s start by creating a sample data frame in PySpark.
Code:
data1 = [{'Name':'Jhon','Sal':25000,'Add':'USA'},{'Name':'Joe','Sal':30000,'Add':'USA'},{'Name':'Tina','Sal':22000,'Add':'IND'},{'Name':'Jhon','Sal':15000,'Add':'USA'}]
The data contains the Name, Salary, and Address that will be used as sample data for Data frame creation.
Code:
a = sc.parallelize(data1)
The sc.parallelize will be used for the creation of RDD with the given Data.
Code:
b = spark.createDataFrame(a)
Post creation, we will use the createDataFrame method for the creation of Data Frame.
This is how the Data Frame looks.
Code:
b.show()
Output:
Let’s apply the Group By function with an aggregate function sum over it.
Code:
b.groupBy("Name")
Output:
This will group Data based on Name as the sql.group.groupedData.
We will use the aggregate function sum to sum the salary column grouped by Name column.
Code:
b.groupBy("Name").sum("Sal").show()
This will return the sum of the salary column grouped by the Name column.
The salary of Jhon is grouped, and the sum of Salary is returned as the Sum.
Output:
The group column can also be done over other columns in PySpark that can be a single column data or multiple columns.
Code:
b.groupBy("Add").sum().show()
This groups the data based on Column value as Add and returns the Sum of the grouped column.
Grouping and sum using the multiple columns.
Code:
b.groupBy("Add","Name").sum().show()
Output:
Conclusion
From the above article, we saw the working of GroupBy Sum in PySpark. From various examples and classifications, we saw how this GroupBy Sum is used in PySpark and what are is use at the programming level. The various methods used showed how it eases the pattern for data analysis and a cost-efficient model for the same. We also saw the internal working and the advantages of GroupBy Sum in PySpark Data Frame and its usage in various programming purposes. Also, the syntax and examples helped us to understand much precisely over the function.
Recommended Articles
This is a guide to PySpark GroupBy Sum. Here we discuss the introduction, working of sum with GroupBy in PySpark, and examples. You may also have a look at the following articles to learn more –