Updated June 20, 2023

Introduction to Pandas Find Duplicates

Dealing with real-world data can be messy and overwhelming at times, as the data is never perfect. It consists of many problems, such as outliers, duplicates, missing values, etc. There is a very popular fact in the data science world that data scientists/data analysts spend 80% of their time in data cleaning and preparation for a machine learning algorithm. In this article, we will be covering a very popular problem, that is, how to find and remove duplicate values/records in a pandas dataframe. Pandas module in Python provides us with some in-built functions such as dataframe.duplicated() to find duplicate values and dataframe.drop_duplicates() to drop duplicate values. We will be discussing these functions along with others in detail in the subsequent sections.

Syntax and Parameters

The basic syntax for dataframe.duplicated() function is as follows :

dataframe.duplicated(subset = 'column_name', keep = {'last', 'first', 'false')

The parameters used in the above-mentioned function are as follows :

Dataframe: Name of the dataframe for which we must find duplicate values.
Subset: Name of the specific column or label based on which duplicate values have to be found.
Keep: While finding duplicate values, which occurrence of the value has to be marked as duplicate.

The subset argument is optional. Having understood the dataframe.duplicated() function to find duplicate records, let us discuss dataframe.drop_duplicates() to remove duplicate values in the dataframe.

The basic syntax for dataframe.drop_duplicates() function is similar to duplicated() function. It can be written as follows :

dataframe.drop_duplicates(subset = 'column_name', keep =  {'last', 'first', 'false'}, inplace = {'True', 'False'})

Inplace: Inplace ensures if the changes are to be made in the original data frame(True) or not(False).

Examples of Pandas Find Duplicates

Now we have discussed the syntax and arguments used for working with functions for dealing with duplicate records in pandas. But no learning is complete without some practical examples; ergo, let’s try a few examples based on these functions. In order to do that, we must first create a dataframe with duplicate records. You may use the following data frame for the purpose.

Code:

#importing pandas
import pandas as pd
#input data
data = {'Country': ['India','India','USA','USA','UK','Germany','India','Germany', 'USA', 'China', 'Japan'],
'Personality': ['Sachin Tendulkar','Sania Mirza','Serena Williams','Venus Willians',
'Morgan Freeman','Michael Schumacher','Priyanka Chopra','Michael Schumacher',
'Serena Williams','Jack Ma','Sakamoto Ryoma']
}
#create a dataframe from the data
df = pd.DataFrame(data, columns = ['Country','Personality'])
#print dataframe
df

The output of the given code snippet would be a data frame called ‘df’ as shown below :

Duplicate Values of Data Frame

We can clearly see that there are a few duplicate values in the data frame.

1. Finding Duplicate Values in the Entire Dataset

In order to find duplicate values in pandas, we use df.duplicated() function. The function returns a series of boolean values depicting whether a record is duplicated.

df.duplicated()

By default, when considering the entire record as input, values in a list are marked as duplicates based on their subsequent occurrence.

2. Finding a Specific Column

In the previous example, we used the duplicated() function without any arguments. Here, we have used the function with a subset argument to find duplicate values in the countries column.

df.duplicated(subset = 'Country')

3. Finding in a Specific Column and Marking Last Occurrence as Not Duplicate

df.duplicated(subset = 'Country', keep = 'last')

4. Finding the Count of Duplicate Records in the Entire Dataset

In order to find the total number of values, we can perform a sum operation on the results obtained from the duplicated() function, as shown below.

df.duplicated().sum()

5. Finding the Count of Duplicate Values in a Specific Column

df.duplicated(subset='Country').sum()

6. Removing Duplicate Records in the Dataset.

df.drop_duplicates(keep = 'first')

The function has successfully removed records no. 7 and 8 as they were duplicated. We should note that the drop_duplicates() function does not make inplace changes by default.

df

The original data frame is still the same with duplicate records. In order to save changes to the original dataframe, we have to use an inplace argument, as shown in the next example.

7. Removing Duplicate Records in the Dataset Inplace.

df.drop_duplicates(keep = 'first', inplace = True)
df

Conclusion

Finding and removing duplicate values can seem daunting for large datasets. But pandas have made it easy by providing us with some in-built functions such as dataframe.duplicated() to find duplicate values and dataframe.drop_duplicates() to remove duplicate values.

Quiz Result
Total Questions	Correct Answers	Wrong Answers	Percentage

Introduction to Pandas Find Duplicates

Syntax and Parameters

Examples of Pandas Find Duplicates

Duplicate Values of Data Frame

1. Finding Duplicate Values in the Entire Dataset

2. Finding a Specific Column

3. Finding in a Specific Column and Marking Last Occurrence as Not Duplicate

4. Finding the Count of Duplicate Records in the Entire Dataset

5. Finding the Count of Duplicate Values in a Specific Column

6. Removing Duplicate Records in the Dataset.

7. Removing Duplicate Records in the Dataset Inplace.

Conclusion

Recommended Articles

Follow us!

APPS

Blog

Courses

Email