Pandas DataFrame describe() Method – Explained with examples

Pandas is a powerful library in Python for data manipulation and analysis, widely used in data science and machine learning. One of the most useful functions in Pandas is the describe() method, which provides a quick overview of the central tendencies, dispersion, and shape of a dataset’s distribution. This blog post will guide you through the describe() method, its features, and how to use it effectively.

What is the describe() Method?

The describe() method generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values. It provides a convenient way to get a snapshot of the data, which can be very useful for initial data exploration and analysis.

Basic Usage

Let’s start with a simple example. First, you need to have Pandas installed. You can install it using pip if you haven’t already:

Python

pip install pandas

Now, let’s import Pandas and create a DataFrame:

Python

import pandas as pd

# Creating a simple DataFrame
data = {
    'A': [1, 2, 3, 4, 5],
    'B': [5, 6, 7, 8, 9],
    'C': [10, 20, 30, 40, 50]
}

df = pd.DataFrame(data)

Using the describe() method on this DataFrame is straightforward:

Python

df.describe()

The output will look something like this:

Bash

              A         B          C
count  5.000000  5.000000   5.000000
mean   3.000000  7.000000  30.000000
std    1.581139  1.581139  15.811388
min    1.000000  5.000000  10.000000
25%    2.000000  6.000000  20.000000
50%    3.000000  7.000000  30.000000
75%    4.000000  8.000000  40.000000
max    5.000000  9.000000  50.000000

Understanding the Output

count: The number of non-null entries in each column.
mean: The average of each column.
std: The standard deviation of each column, which measures the amount of variation or dispersion of the data.
min: The minimum value in each column.
25%: The 25th percentile (first quartile), which is the value below which 25% of the data falls.
50%: The 50th percentile (median), which is the value below which 50% of the data falls.
75%: The 75th percentile (third quartile), which is the value below which 75% of the data falls.
max: The maximum value in each column.
unique: Number of unique categories.
top: Most frequent category.
freq: Frequency of the most frequent category.

Customizing the describe() Method

The describe() method can be customized to include different types of data or to show more specific statistics.

Including Non-Numeric Columns

By default, describe() only includes numeric columns. If your DataFrame includes non-numeric data and you want to include it in the summary, you can use the include parameter:

Python

data = {
    'A': [1, 2, 3, 4, 5],
    'B': [5, 6, 7, 8, 9],
    'C': [10, 20, 30, 40, 50],
    'D': ['a', 'b', 'c', 'd', 'e']
}

df = pd.DataFrame(data)
df.describe(include='all')

This will include the non-numeric column in the output, showing different statistics such as the number of unique values, the most frequent value, and its frequency.

Output:

Markdown

               A         B          C    D
count   5.000000  5.000000   5.000000    5
unique       NaN       NaN        NaN    5
top          NaN       NaN        NaN    a
freq         NaN       NaN        NaN    1
mean    3.000000  7.000000  30.000000  NaN
std     1.581139  1.581139  15.811388  NaN
min     1.000000  5.000000  10.000000  NaN
25%     2.000000  6.000000  20.000000  NaN
50%     3.000000  7.000000  30.000000  NaN
75%     4.000000  8.000000  40.000000  NaN
max     5.000000  9.000000  50.000000  NaN

Descriptive Statistics for Specific Columns

You can also specify particular columns to include in the summary

The describe method in Pandas can be used with various include parameters to specify which columns to include in the summary. Here’s a detailed explanation of the different include parameters:

1. ‘include=None’ (default)

When include=None, describe provides a summary of all numeric columns in the DataFrame.

Python

df.describe() # or df.describe(include=None)

Output:

Markdown

              A         B          C
count  5.000000  5.000000   5.000000
mean   3.000000  7.000000  30.000000
std    1.581139  1.581139  15.811388
min    1.000000  5.000000  10.000000
25%    2.000000  6.000000  20.000000
50%    3.000000  7.000000  30.000000
75%    4.000000  8.000000  40.000000
max    5.000000  9.000000  50.000000

2. include=[‘O’] or include=’object’

When include=['O'] or include='object', describe provides a summary of all object (string) columns.

Python

df.describe(include=['O'])

Output:

Markdown

         D
count    5
unique   5
top      a
freq     1

3. include=[‘number’] or include=’number’

When include=['number'] or include='number', describe provides a summary of all numeric columns.

Python

df.describe(include=['number'])

Output:

Markdown

              A         B          C
count  5.000000  5.000000   5.000000
mean   3.000000  7.000000  30.000000
std    1.581139  1.581139  15.811388
min    1.000000  5.000000  10.000000
25%    2.000000  6.000000  20.000000
50%    3.000000  7.000000  30.000000
75%    4.000000  8.000000  40.000000
max    5.000000  9.000000  50.000000

4. include=[‘all’]

When include=['all'], describe provides a summary of all columns, regardless of their data type.

Python

df.describe(include='all')

Output:

Markdown

                A         B          C    D
count   5.000000  5.000000   5.000000    5
unique       NaN       NaN        NaN    5
top          NaN       NaN        NaN    a
freq         NaN       NaN        NaN    1
mean    3.000000  7.000000  30.000000  NaN
std     1.581139  1.581139  15.811388  NaN
min     1.000000  5.000000  10.000000  NaN
25%     2.000000  6.000000  20.000000  NaN
50%     3.000000  7.000000  30.000000  NaN
75%     4.000000  8.000000  40.000000  NaN
max     5.000000  9.000000  50.000000  NaN

For numeric columns (A, B, C): Provides the same statistics as include=None.
For object columns (D): Provides the same statistics as include=['O'].

5. include=[‘category’]

When include=['category'], describe provides a summary of all categorical columns.

Python

data = {
    'A': [1, 2, 3, 4, 5],
    'B': [5, 6, 7, 8, 9],
    'C': pd.Categorical(['small', 'large', 'large', 'small', 'medium'])
}

df = pd.DataFrame(data)
df.describe(include=['category'])

Output:

Markdown

              C
count         5
unique        3
top       large
freq          2

By using the include parameter in the describe method, you can tailor the summary statistics to focus on specific columns or data types, providing a clear and concise overview of your dataset’s characteristics.

Handling Missing Data

If your DataFrame contains missing data, describe() will still work, but it will ignore NaN values by default. You can manage missing data before using describe() using methods like dropna() or fillna():

Python

# Dropping rows with any NaN values
df.dropna().describe()

Here, dropna() removes all rows that contain any NaN values and describe() generates descriptive statistics for the remaining numeric columns.

Python

# Filling NaN values with a specific value
df.fillna(0).describe()

Here, fillna(0) replaces all NaN values with 0 and describe() generates descriptive statistics for the numeric columns after filling NaN values with 0.

dropna(): Useful when you want to remove any rows that contain missing values, resulting in a potentially smaller DataFrame but with no NaNs.
fillna(0): Replaces NaNs with a specific value (0 in this case), retaining the original size of the DataFrame but changing the data distribution by filling in the missing values.

Conclusion

The describe() method in Pandas is a powerful tool for getting a quick statistical summary of your dataset. It’s an essential part of the data analysis workflow, helping you understand the distribution, central tendencies, and variability of your data. By customizing the method with parameters like include, you can tailor the output to better suit your needs, whether you’re dealing with numeric or non-numeric data.

By leveraging the describe() method, you can quickly gain insights into your data, which can inform further analysis and decision-making. Whether you’re a beginner or an experienced data scientist, understanding and using describe() effectively is a valuable skill in your data manipulation toolkit.

What is the describe() Method?

Basic Usage

Understanding the Output

Customizing the describe() Method

Including Non-Numeric Columns

Descriptive Statistics for Specific Columns

1. ‘include=None’ (default)

2. include=[‘O’] or include=’object’

3. include=[‘number’] or include=’number’

4. include=[‘all’]

5. include=[‘category’]

Output:

Handling Missing Data

Conclusion

Leave a Comment Cancel reply