Updated April 18, 2023
Introduction to PySpark GroupBy Count
PySpark GroupBy Count is a function in PySpark that allows to group rows together based on some columnar value and count the number of rows associated after grouping in the spark application. The group By Count function is used to count the grouped Data, which are grouped based on some conditions and the final count of aggregated data is shown as the result. In simple words, if we try to understand what exactly groupBy count does it simply groups the rows in a Spark Data Frame having some values and counts the values generated.
The identical data are arranged in groups and the data is shuffled accordingly based on partition and condition. Advance aggregation of Data over multiple columns is also supported by PySpark GroupBy. Post performing Group By over a Data Frame the return type is a Relational Grouped Data set object that contains the aggregated function from which we can aggregate the Data.
Syntax:
The syntax for PYSPARK GROUPBY COUNT function is :
df.groupBy('columnName').count().show()
- df: The PySpark DataFrame
- columnName: The ColumnName for which the GroupBy Operations needs to be done.
- count() – To Count the total number of elements after groupBY.
a.groupby("Name").count().show()
Screenshot:
How does GroupBy Count works in PySpark?
Working of GroupBy Count in PySpark.
Let us see somehow the GROUPBY COUNT function works in PySpark:
The GROUP BY function is used to group data together based on the same key value that operates on RDD / Data Frame in a PySpark application.
The data having the same key are shuffled together and are brought to a place that can be grouped together. The shuffling happens over the entire network and this makes the operation a bit costlier.
The one with the same key is clubbed together and the value is returned based on the condition.
The count function then counts the grouped data and displays the counted result.
Group By can be used to Group Multiple columns together with multiple column names. Group By returns a single row for each combination that is grouped together and an aggregate function is used to compute the value from the grouped data.
Examples
Let us see some Example of how the PYSPARK GROUPBY COUNT function works:
Example #1
Let’s start by creating a simple Data Frame over we want to use the Filter Operation.
Creation of DataFrame :
a= spark.createDataFrame(["SAM","JOHN","AND","ROBIN","ANAND","ANAND"], "string").toDF("Name")
Let’s start with a simple groupBy code that filters the name in Data Frame.
a.groupby("Name")
This will Group the element with the name. The element with the same key are grouped together and the result is displayed.
Post aggregation function we can count the number of elements in the Data Frame using the count() function.
a= spark.createDataFrame(["SAM","JOHN","AND","ROBIN","ANAND"], "string").toDF("Name")
a.groupby("Name").count().show()
Output:
Example #2
Let’s try to understand more precisely by creating a data Frame with one than one column and using the count function on it.
data1 = [{'Name':'Jhon','ID':2,'Add':'USA'},{'Name':'Joe','ID':3,'Add':'USA'},{'Name':'Tina','ID':2,'Add':'IND'}]
A sample data is created with Name, ID, and ADD as the field.
a = sc.parallelize(data1)
RDD is created using sc.parallelize.
b = spark.createDataFrame(a)
Created DataFrame using Spark.createDataFrame.
b.groupBy("Add").count().show()
The count function is used to find the number of records post group By. This counts the number of elements post Grouping.
Output:
Example #3
Let us check some more examples for Group By Count. We can use GroupBY over multiple elements from a column in the Data Frame.
data1 = [{'Name':'Jhon','ID':2,'Add':'USA'},{'Name':'Joe','ID':3,'Add':'USA'},{'Name':'Tina','ID':2,'Add':'IND'},{'Name':'Jhon','ID':2,'Add':'IND'},{'Name':'Tom','ID':2,'Add':'IND'}]
a = sc.parallelize(data1)
b = spark.createDataFrame(a)
b.groupBy("Add","Name").count().show()
This will group element based on multiple columns and then count the record for each condition.
Screenshot:
Group By With Single Column:
b.groupBy("Add").count().show()
ScreenShot:
Group by with other Columns and count the elements using the count function.
b.groupBy("Name").count().show()
Output:
These are some of the Examples of GroupBy Count Function in PySpark.
Conclusion
From the above article, we saw the use of groupBy Count Operation in PySpark. From various examples and classifications, we tried to understand how the GROUPBY COUNT method works in PySpark and what are is used at the programming level.
We also saw the internal working and the advantages of having GroupBy Count in Spark Data Frame and its usage in various programming purposes. Also, the syntax and examples helped us to understand much precisely the function.
Recommended Articles
We hope that this EDUCBA information on “PySpark GroupBy Count” was beneficial to you. You can view EDUCBA’s recommended articles for more information.