Pandas DataFrame duplicated() Method – Explained

When working with data in Python, ensuring the integrity and quality of your dataset is crucial. One common issue is the presence of duplicate entries, which can skew analyses and lead to misleading results. The duplicated() method in Pandas is a powerful tool for identifying these duplicate rows in your DataFrame. In this blog, we’ll delve into the functionality of the duplicated() method, its various parameters, and practical examples to help you master its use.

What is the duplicated() Method?

The duplicated() method in Pandas returns a Boolean Series indicating whether each row is a duplicate of a previous row or not. It provides a quick way to identify and handle duplicate entries within your DataFrame.

Syntax
Python
DataFrame.duplicated(subset=None, keep='first')

Parameters
  • subset: This parameter allows you to specify a subset of columns to consider when identifying duplicates. By default, all columns are considered.
  • keep: Determines which duplicates (if any) to mark as True.
    • 'first': Mark duplicates as True except for the first occurrence. This is the default behavior.
    • 'last': Mark duplicates as True except for the last occurrence.
    • False: Mark all duplicates as True.
Returns

A Series of Boolean values, where True indicates that a row is a duplicate.


Examples

Let’s explore some practical examples to understand how duplicated() works.

Example 1: Basic Usage
Python
import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
    'Age': [25, 30, 35, 25, 30],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles']
}

df = pd.DataFrame(data)
print("Original DataFrame:\n", df)

# Identify duplicate rows
duplicates = df.duplicated()
print("\nDuplicated rows:\n", duplicates)

Output:

Bash
Original DataFrame:
       Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
3    Alice   25     New York
4      Bob   30  Los Angeles

Duplicated rows:
0    False
1    False
2    False
3    True
4    True
dtype: bool

In this example, the duplicated() method identifies rows 3 and 4 as duplicates of rows 0 and 1, respectively.


Example 2: Using the ‘keep’ Parameter
Python
# Creating a DataFrame with duplicate values
df = pd.DataFrame({
    'A': [1, 2, 2, 3, 4, 4, 4],
    'B': ['a', 'b', 'b', 'c', 'd', 'd', 'd']
})

# Mark duplicates except for the last occurrence
duplicates_last = df.duplicated(keep='last')
print("\nDuplicated rows (keep='last'):\n", duplicates_last)

# Mark all duplicates
duplicates_all = df.duplicated(keep=False)
print("\nDuplicated rows (keep=False):\n", duplicates_all)

Output:

Bash
Duplicated rows (keep='last'):
0    False
1     True
2    False
3    False
4     True
5     True
6    False
dtype: bool

Duplicated rows (keep=False):
0    False
1     True
2     True
3    False
4     True
5     True
6     True
dtype: bool

Here, setting keep='last' marks the first occurrences as duplicates. This means the last occurrence of each duplicate is considered unique, and all previous occurrences are marked as duplicates.

On the other hand, setting keep=False marks all duplicate entries. This means that both the original and all duplicate values are marked as duplicates (set to True), leaving only the unique values in the DataFrame considered as non-duplicates.


Example 3: Using the ‘subset’ Parameter
Python
# Identify duplicates based on the 'Name' column only
duplicates_subset = df.duplicated(subset=['Name'])
print("\nDuplicated rows based on 'Name' column:\n", duplicates_subset)

Output:

Bash
Duplicated rows based on 'Name' column:
0    False
1    False
2    False
3    True
4    True
dtype: bool

In this case, duplicates are identified based solely on the ‘Name’ column, ignoring other columns.


Practical Applications

1. Removing Duplicate Rows

To remove duplicate rows from your DataFrame, you can use the drop_duplicates() method in conjunction with duplicated().

Python
# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
print("\nDataFrame after removing duplicates:\n", df_no_duplicates)

Output:

Markdown
DataFrame after removing duplicates:
       Name  Age         City
0    Alice   25     New York
1      Bob   30  Los Angeles
2  Charlie   35      Chicago
2. Counting Duplicates

To count the number of duplicate rows, you can sum the Boolean Series returned by duplicated().

Python
# Count the number of duplicate rows
num_duplicates = df.duplicated().sum()
print("\nNumber of duplicate rows:", num_duplicates)

Output:

Markdown
Number of duplicate rows: 2

Conclusion

The duplicated() method is a versatile and essential tool for data cleaning and preprocessing in Pandas. By understanding its parameters and how to apply them, you can efficiently identify and handle duplicate rows in your datasets, ensuring the accuracy and reliability of your data analyses.

Whether you’re preparing data for machine learning, conducting exploratory data analysis, or cleaning large datasets, mastering the duplicated() method will significantly enhance your data manipulation skills in Python.

Also Explore:

Leave a Comment