Pandas is a powerful and versatile data manipulation library for Python. One of the most common tasks when working with data is handling missing values. Pandas provides several methods to deal with missing data, and one of the most frequently used is DataFrame.dropna(). In this blog post, we’ll explore the dropna()
method in detail, with examples to illustrate its use.
What is dropna()?
The dropna()
method is used to remove missing values from a DataFrame. Missing values in Pandas are represented by NaN
(Not a Number). The dropna()
method allows you to remove rows or columns that contain missing values, providing various options to control the behavior.
Syntax
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
Parameters
- axis: {0 or ‘index’, 1 or ‘columns’}, default 0
- Determines if rows or columns that contain missing values are removed.
- 0 or ‘index’: Drop rows with missing values.
- 1 or ‘columns’: Drop columns with missing values.
- how: {‘any’, ‘all’}, default ‘any’
- Determines if a row or column is removed when encountering missing values.
- ‘any’: If any NA values are present, drop that row or column.
- ‘all’: If all values are NA, drop that row or column.
- thresh: int, optional
- Requires that many non-NA values. If set, rows or columns with fewer than thresh non-NA values are dropped.
- subset: array-like, optional
- Labels along other axis to consider, e.g., if you are dropping rows these would be a list of columns to include.
- inplace: bool, default False
- If True, do operation inplace and return None.
Examples
Let’s go through some examples to understand how dropna()
works.
Example 1: Dropping Rows with Any Missing Values
import pandas as pd
# Create a sample DataFrame
data = {
'A': [1, 2, None, 4],
'B': [5, None, None, 8],
'C': [9, 10, 11, None]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Drop rows with any missing values
df_dropped = df.dropna()
print("\nDataFrame after dropping rows with any missing values:")
print(df_dropped)
Output:
Original DataFrame:
A B C
0 1.0 5.0 9.0
1 2.0 NaN 10.0
2 NaN NaN 11.0
3 4.0 8.0 NaN
DataFrame after dropping rows with any missing values:
A B C
0 1.0 5.0 9.0
Explanation:
In this example, we create a DataFrame with some missing values (NaN
). The dropna()
method is called with default parameters, which means it drops any row that contains at least one missing value. As a result, only the first row is retained, as it has no missing values.
Example 2: Dropping Columns with Any Missing Values
# Create a sample DataFrame
data = {
'A': [1, 2, None, 4],
'B': [5, None, None, 8],
'C': [9, 10, 11, 4]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Drop columns with any missing values
df_dropped = df.dropna(axis=1)
print("\nDataFrame after dropping columns with any missing values:")
print(df_dropped)
Output:
DataFrame after dropping columns with any missing values:
Original DataFrame:
A B C
0 1.0 5.0 9
1 2.0 NaN 10
2 NaN NaN 11
3 4.0 8.0 4
DataFrame after dropping columns with any missing values:
C
0 9
1 10
2 11
3 4
Explanation:
Here, the dropna()
method is called with axis=1
, which means it will drop any column containing at least one missing value. Consequently, columns ‘A’ and ‘B’ are removed because they both have NaN
values, leaving only column ‘C’.
Example 3: Dropping Rows with All Missing Values
# Create a sample DataFrame with all missing values in some rows
data = {
'A': [1, 2, None, 4],
'B': [None, None, None, None],
'C': [9, 10, None, None]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Drop rows where all values are missing
df_dropped = df.dropna(how='all')
print("\nDataFrame after dropping rows where all values are missing:")
print(df_dropped)
Output:
Original DataFrame:
A B C
0 1.0 NaN 9.0
1 2.0 NaN 10.0
2 NaN NaN NaN
3 4.0 NaN NaN
DataFrame after dropping rows where all values are missing:
A B C
0 1.0 NaN 9.0
1 2.0 NaN 10.0
3 4.0 NaN NaN
Explanation:
In this example, we have a DataFrame where some rows are entirely composed of NaN
values. By using dropna(how='all')
, we instruct Pandas to drop only those rows where all the values are missing. This retains rows that have at least one non-missing value.
Example 4: Using ‘thresh’ Parameter
# Create a sample DataFrame with all missing values in some rows
data = {
'A': [1, 2, None, 4],
'B': [None, None, None, None],
'C': [9, 10, None, None]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Drop rows with fewer than 2 non-NA values
df_dropped = df.dropna(thresh=2)
print("\nDataFrame after dropping rows with fewer than 2 non-NA values:")
print(df_dropped)
Output:
Original DataFrame:
A B C
0 1.0 None 9.0
1 2.0 None 10.0
2 NaN None NaN
3 4.0 None NaN
DataFrame after dropping rows with fewer than 2 non-NA values:
A B C
0 1.0 None 9.0
1 2.0 None 10.0
Explanation:
The thresh
parameter is used to set a threshold for the minimum number of non-missing values required to retain a row. In this example, thresh=2
means that any row with fewer than two non-missing values will be dropped. This retains rows that have at least two non-missing values, effectively filtering out the second row(which has 0 non-missing value) and third row(which has only one non-missing value).
Example 5: Using ‘subset’ Parameter
# Create a sample DataFrame
data = {
'A': [1, 2, None, 4],
'B': [5, None, None, 8],
'C': [9, 10, 11, 4]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Drop rows based on missing values in column 'B'
df_dropped = df.dropna(subset=['B'])
print("\nDataFrame after dropping rows based on missing values in column 'B':")
print(df_dropped)
Output:
Original DataFrame:
A B C
0 1.0 5.0 9
1 2.0 NaN 10
2 NaN NaN 11
3 4.0 8.0 4
DataFrame after dropping rows based on missing values in column 'B':
A B C
0 1.0 5.0 9
3 4.0 8.0 4
Explanation:
The subset
parameter allows you to specify a particular column (or columns) to consider when dropping rows. In this example, we drop rows based on missing values in column ‘B’. Only the rows where column ‘B’ has non-missing values are retained, resulting in the removal of rows 1 and 2.
Example 6: Dropping Missing Values Inplace
# Create a sample DataFrame
data = {
'A': [1, 2, None, 4],
'B': [5, None, None, 8],
'C': [9, 10, 11, 4]
}
df = pd.DataFrame(data)
print("Original DataFrame:")
print(df)
# Drop rows with any missing values inplace
df.dropna(inplace=True)
print("\nDataFrame after dropping rows with any missing values inplace:")
print(df)
Output:
Original DataFrame:
A B C
0 1.0 5.0 9
1 2.0 NaN 10
2 NaN NaN 11
3 4.0 8.0 4
DataFrame after dropping rows with any missing values inplace:
A B C
0 1.0 5.0 9
3 4.0 8.0 4
Explanation:
By setting inplace=True
, the dropna()
method modifies the original DataFrame directly without returning a new DataFrame. This is useful when you want to avoid creating a new variable and directly apply the changes to the original DataFrame. As before, rows with any missing values are dropped, leaving only the first row.
Conclusion
The dropna()
method in Pandas provides a flexible and powerful way to handle missing values in your data. By understanding and utilizing its various parameters, you can clean your data according to your specific requirements. Whether you need to drop rows or columns, consider specific subsets, or apply a threshold, dropna()
makes it easy to manage missing data efficiently.
Also Explore: