When working with data in Python, ensuring the integrity and quality of your dataset is crucial. One common issue is the presence of duplicate entries, which can skew analyses and lead to misleading results. The duplicated() method in Pandas is a powerful tool for identifying these duplicate rows in your DataFrame. In this blog, we’ll delve into the functionality of the duplicated()
method, its various parameters, and practical examples to help you master its use.
What is the duplicated() Method?
The duplicated()
method in Pandas returns a Boolean Series indicating whether each row is a duplicate of a previous row or not. It provides a quick way to identify and handle duplicate entries within your DataFrame.
Syntax
DataFrame.duplicated(subset=None, keep='first')
Parameters
- subset: This parameter allows you to specify a subset of columns to consider when identifying duplicates. By default, all columns are considered.
- keep: Determines which duplicates (if any) to mark as
True
.'first'
: Mark duplicates asTrue
except for the first occurrence. This is the default behavior.'last'
: Mark duplicates asTrue
except for the last occurrence.False
: Mark all duplicates asTrue
.
Returns
A Series of Boolean values, where True
indicates that a row is a duplicate.
Examples
Let’s explore some practical examples to understand how duplicated()
works.
Example 1: Basic Usage
import pandas as pd
# Sample DataFrame
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'Bob'],
'Age': [25, 30, 35, 25, 30],
'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Los Angeles']
}
df = pd.DataFrame(data)
print("Original DataFrame:\n", df)
# Identify duplicate rows
duplicates = df.duplicated()
print("\nDuplicated rows:\n", duplicates)
Output:
Original DataFrame:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
3 Alice 25 New York
4 Bob 30 Los Angeles
Duplicated rows:
0 False
1 False
2 False
3 True
4 True
dtype: bool
In this example, the duplicated()
method identifies rows 3 and 4 as duplicates of rows 0 and 1, respectively.
Example 2: Using the ‘keep’ Parameter
# Creating a DataFrame with duplicate values
df = pd.DataFrame({
'A': [1, 2, 2, 3, 4, 4, 4],
'B': ['a', 'b', 'b', 'c', 'd', 'd', 'd']
})
# Mark duplicates except for the last occurrence
duplicates_last = df.duplicated(keep='last')
print("\nDuplicated rows (keep='last'):\n", duplicates_last)
# Mark all duplicates
duplicates_all = df.duplicated(keep=False)
print("\nDuplicated rows (keep=False):\n", duplicates_all)
Output:
Duplicated rows (keep='last'):
0 False
1 True
2 False
3 False
4 True
5 True
6 False
dtype: bool
Duplicated rows (keep=False):
0 False
1 True
2 True
3 False
4 True
5 True
6 True
dtype: bool
Here, setting keep='last'
marks the first occurrences as duplicates. This means the last occurrence of each duplicate is considered unique, and all previous occurrences are marked as duplicates.
On the other hand, setting keep=False
marks all duplicate entries. This means that both the original and all duplicate values are marked as duplicates (set to True
), leaving only the unique values in the DataFrame considered as non-duplicates.
Example 3: Using the ‘subset’ Parameter
# Identify duplicates based on the 'Name' column only
duplicates_subset = df.duplicated(subset=['Name'])
print("\nDuplicated rows based on 'Name' column:\n", duplicates_subset)
Output:
Duplicated rows based on 'Name' column:
0 False
1 False
2 False
3 True
4 True
dtype: bool
In this case, duplicates are identified based solely on the ‘Name’ column, ignoring other columns.
Practical Applications
1. Removing Duplicate Rows
To remove duplicate rows from your DataFrame, you can use the drop_duplicates()
method in conjunction with duplicated()
.
# Remove duplicate rows
df_no_duplicates = df.drop_duplicates()
print("\nDataFrame after removing duplicates:\n", df_no_duplicates)
Output:
DataFrame after removing duplicates:
Name Age City
0 Alice 25 New York
1 Bob 30 Los Angeles
2 Charlie 35 Chicago
2. Counting Duplicates
To count the number of duplicate rows, you can sum the Boolean Series returned by duplicated()
.
# Count the number of duplicate rows
num_duplicates = df.duplicated().sum()
print("\nNumber of duplicate rows:", num_duplicates)
Output:
Number of duplicate rows: 2
Conclusion
The duplicated()
method is a versatile and essential tool for data cleaning and preprocessing in Pandas. By understanding its parameters and how to apply them, you can efficiently identify and handle duplicate rows in your datasets, ensuring the accuracy and reliability of your data analyses.
Whether you’re preparing data for machine learning, conducting exploratory data analysis, or cleaning large datasets, mastering the duplicated()
method will significantly enhance your data manipulation skills in Python.
Also Explore: