Updated March 31, 2023

Introduction to PySpark GroupBy Sum

The following article provides an outline for PySpark GroupBy Sum. PySpark GroupBy is a Grouping function in the PySpark data model that uses some columnar values to group rows together. This works on the model of grouping Data based on some columnar conditions and aggregating the data as the final result. It is a GroupBy function with an aggregate function as Sum that groups and sums data based on some columnar data value. This returns the result as the sum of the column by grouping the data together; this is an important function in PySpark that is used for the summation of data needed for data analysis.

Syntax of PySpark GroupBy Sum

Given below is the syntax mentioned:

Df2 = b.groupBy("Name").sum("Sal")

b: The data frame created for PySpark.
groupBy(): The Group By function that needs to be called with Aggregate function as Sum().

The Sum function can be taken by passing the column name as a parameter.

Df2: The new data frame selected after conversion.

Output:

Working of Sum with GroupBy in PySpark

The GroupBy function follows the method of Key value that operates over the PySpark RDD/Data frame model. Same Key Data are shuffled using the partitions and are brought together being grouped over a partition.
The shuffling operation is used for the movement of data for grouping. The same key elements are grouped, and the value is returned. The aggregate function sum is used to sum the grouped function over the column value, and the result is then returned.
The function sums up all the grouped column data together, and the result is returned.

Examples of PySpark GroupBy Sum

Different examples are mentioned below:

Let’s start by creating a sample data frame in PySpark.

Code:

data1 = [{'Name':'Jhon','Sal':25000,'Add':'USA'},{'Name':'Joe','Sal':30000,'Add':'USA'},{'Name':'Tina','Sal':22000,'Add':'IND'},{'Name':'Jhon','Sal':15000,'Add':'USA'}]

The data contains the Name, Salary, and Address that will be used as sample data for Data frame creation.

Code:

a = sc.parallelize(data1)

The sc.parallelize will be used for the creation of RDD with the given Data.

Code:

b = spark.createDataFrame(a)

Post creation, we will use the createDataFrame method for the creation of Data Frame.

This is how the Data Frame looks.

Code:

b.show()

Output:

Let’s apply the Group By function with an aggregate function sum over it.

Code:

b.groupBy("Name")

Output:

This will group Data based on Name as the sql.group.groupedData.

We will use the aggregate function sum to sum the salary column grouped by Name column.

Code:

b.groupBy("Name").sum("Sal").show()

This will return the sum of the salary column grouped by the Name column.

The salary of Jhon is grouped, and the sum of Salary is returned as the Sum.

Output:

The group column can also be done over other columns in PySpark that can be a single column data or multiple columns.

Code:

b.groupBy("Add").sum().show()

This groups the data based on Column value as Add and returns the Sum of the grouped column.

Grouping and sum using the multiple columns.

Code:

b.groupBy("Add","Name").sum().show()

Output:

Note: PySpark GroupBy Sum is used to group data based on the sum as the aggregate function. It takes the column name as the input parameter on which the grouping needs to be done. Then, it takes Sum as an aggregate function post grouping the data. Finally, it shuffles the data while grouping the data element.

Conclusion

From the above article, we saw the working of GroupBy Sum in PySpark. From various examples and classifications, we saw how this GroupBy Sum is used in PySpark and what are is use at the programming level. The various methods used showed how it eases the pattern for data analysis and a cost-efficient model for the same. We also saw the internal working and the advantages of GroupBy Sum in PySpark Data Frame and its usage in various programming purposes. Also, the syntax and examples helped us to understand much precisely over the function.

Quiz Result
Total Questions	Correct Answers	Wrong Answers	Percentage

Introduction to PySpark GroupBy Sum

Working of Sum with GroupBy in PySpark

Examples of PySpark GroupBy Sum

Conclusion

Recommended Articles

Follow us!

APPS

Blog

Courses

Email