Pandas DataFrame.dropna() Method – Explained with examples

Pandas is a powerful and versatile data manipulation library for Python. One of the most common tasks when working with data is handling missing values. Pandas provides several methods to deal with missing data, and one of the most frequently used is DataFrame.dropna(). In this blog post, we’ll explore the dropna() method in detail, with examples to illustrate its use.

What is dropna()?

The dropna() method is used to remove missing values from a DataFrame. Missing values in Pandas are represented by NaN (Not a Number). The dropna() method allows you to remove rows or columns that contain missing values, providing various options to control the behavior.

Syntax
Python
DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
Parameters
  • axis: {0 or ‘index’, 1 or ‘columns’}, default 0
  • Determines if rows or columns that contain missing values are removed.
  • 0 or ‘index’: Drop rows with missing values.
  • 1 or ‘columns’: Drop columns with missing values.
  • how: {‘any’, ‘all’}, default ‘any’
  • Determines if a row or column is removed when encountering missing values.
  • ‘any’: If any NA values are present, drop that row or column.
  • ‘all’: If all values are NA, drop that row or column.
  • thresh: int, optional
  • Requires that many non-NA values. If set, rows or columns with fewer than thresh non-NA values are dropped.
  • subset: array-like, optional
  • Labels along other axis to consider, e.g., if you are dropping rows these would be a list of columns to include.
  • inplace: bool, default False
  • If True, do operation inplace and return None.

Examples

Let’s go through some examples to understand how dropna() works.

Example 1: Dropping Rows with Any Missing Values
Python
import pandas as pd

# Create a sample DataFrame
data = {
    'A': [1, 2, None, 4],
    'B': [5, None, None, 8],
    'C': [9, 10, 11, None]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Drop rows with any missing values
df_dropped = df.dropna()

print("\nDataFrame after dropping rows with any missing values:")
print(df_dropped)

Output:

Markdown
Original DataFrame:

     A    B     C
0  1.0  5.0   9.0
1  2.0  NaN  10.0
2  NaN  NaN  11.0
3  4.0  8.0   NaN

DataFrame after dropping rows with any missing values:

     A    B    C
0  1.0  5.0  9.0

Explanation:

In this example, we create a DataFrame with some missing values (NaN). The dropna() method is called with default parameters, which means it drops any row that contains at least one missing value. As a result, only the first row is retained, as it has no missing values.

Example 2: Dropping Columns with Any Missing Values
Python
# Create a sample DataFrame
data = {
    'A': [1, 2, None, 4],
    'B': [5, None, None, 8],
    'C': [9, 10, 11, 4]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Drop columns with any missing values
df_dropped = df.dropna(axis=1)

print("\nDataFrame after dropping columns with any missing values:")
print(df_dropped)

Output:

DataFrame after dropping columns with any missing values:

Markdown
Original DataFrame:

     A    B   C
0  1.0  5.0   9
1  2.0  NaN  10
2  NaN  NaN  11
3  4.0  8.0   4

DataFrame after dropping columns with any missing values:
     C
0    9
1   10
2   11
3    4

Explanation:

Here, the dropna() method is called with axis=1, which means it will drop any column containing at least one missing value. Consequently, columns ‘A’ and ‘B’ are removed because they both have NaN values, leaving only column ‘C’.

Example 3: Dropping Rows with All Missing Values
Python
# Create a sample DataFrame with all missing values in some rows
data = {
    'A': [1, 2, None, 4],
    'B': [None, None, None, None],
    'C': [9, 10, None, None]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Drop rows where all values are missing
df_dropped = df.dropna(how='all')

print("\nDataFrame after dropping rows where all values are missing:")
print(df_dropped)

Output:

Markdown
Original DataFrame:

     A    B     C
0  1.0  NaN   9.0
1  2.0  NaN  10.0
2  NaN  NaN   NaN
3  4.0  NaN   NaN

DataFrame after dropping rows where all values are missing:

     A    B     C
0  1.0  NaN   9.0
1  2.0  NaN  10.0
3  4.0  NaN   NaN

Explanation:

In this example, we have a DataFrame where some rows are entirely composed of NaN values. By using dropna(how='all'), we instruct Pandas to drop only those rows where all the values are missing. This retains rows that have at least one non-missing value.

Example 4: Using ‘thresh’ Parameter
Python
# Create a sample DataFrame with all missing values in some rows
data = {
    'A': [1, 2, None, 4],
    'B': [None, None, None, None],
    'C': [9, 10, None, None]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Drop rows with fewer than 2 non-NA values
df_dropped = df.dropna(thresh=2)

print("\nDataFrame after dropping rows with fewer than 2 non-NA values:")
print(df_dropped)

Output:

Markdown
Original DataFrame:

     A     B     C
0  1.0  None   9.0
1  2.0  None  10.0
2  NaN  None   NaN
3  4.0  None   NaN

DataFrame after dropping rows with fewer than 2 non-NA values:

     A     B     C
0  1.0  None   9.0
1  2.0  None  10.0

Explanation:

The thresh parameter is used to set a threshold for the minimum number of non-missing values required to retain a row. In this example, thresh=2 means that any row with fewer than two non-missing values will be dropped. This retains rows that have at least two non-missing values, effectively filtering out the second row(which has 0 non-missing value) and third row(which has only one non-missing value).

Example 5: Using ‘subset’ Parameter
Python
# Create a sample DataFrame
data = {
    'A': [1, 2, None, 4],
    'B': [5, None, None, 8],
    'C': [9, 10, 11, 4]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Drop rows based on missing values in column 'B'
df_dropped = df.dropna(subset=['B'])

print("\nDataFrame after dropping rows based on missing values in column 'B':")
print(df_dropped)

Output:

Markdown
Original DataFrame:
     A    B   C
0  1.0  5.0   9
1  2.0  NaN  10
2  NaN  NaN  11
3  4.0  8.0   4

DataFrame after dropping rows based on missing values in column 'B':
     A    B  C
0  1.0  5.0  9
3  4.0  8.0  4

Explanation:

The subset parameter allows you to specify a particular column (or columns) to consider when dropping rows. In this example, we drop rows based on missing values in column ‘B’. Only the rows where column ‘B’ has non-missing values are retained, resulting in the removal of rows 1 and 2.

Example 6: Dropping Missing Values Inplace
Python
# Create a sample DataFrame
data = {
    'A': [1, 2, None, 4],
    'B': [5, None, None, 8],
    'C': [9, 10, 11, 4]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

# Drop rows with any missing values inplace
df.dropna(inplace=True)

print("\nDataFrame after dropping rows with any missing values inplace:")
print(df)

Output:

Markdown
Original DataFrame:

     A    B   C
0  1.0  5.0   9
1  2.0  NaN  10
2  NaN  NaN  11
3  4.0  8.0   4

DataFrame after dropping rows with any missing values inplace:

     A    B  C
0  1.0  5.0  9
3  4.0  8.0  4

Explanation:

By setting inplace=True, the dropna() method modifies the original DataFrame directly without returning a new DataFrame. This is useful when you want to avoid creating a new variable and directly apply the changes to the original DataFrame. As before, rows with any missing values are dropped, leaving only the first row.

Conclusion

The dropna() method in Pandas provides a flexible and powerful way to handle missing values in your data. By understanding and utilizing its various parameters, you can clean your data according to your specific requirements. Whether you need to drop rows or columns, consider specific subsets, or apply a threshold, dropna() makes it easy to manage missing data efficiently.

Also Explore:

Leave a Comment