Pandas is a powerful library in Python for data manipulation and analysis, widely used in data science and machine learning. One of the most useful functions in Pandas is the describe() method, which provides a quick overview of the central tendencies, dispersion, and shape of a dataset’s distribution. This blog post will guide you through the describe()
method, its features, and how to use it effectively.
What is the describe() Method?
The describe()
method generates descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN
values. It provides a convenient way to get a snapshot of the data, which can be very useful for initial data exploration and analysis.
Basic Usage
Let’s start with a simple example. First, you need to have Pandas installed. You can install it using pip if you haven’t already:
pip install pandas
Now, let’s import Pandas and create a DataFrame:
import pandas as pd
# Creating a simple DataFrame
data = {
'A': [1, 2, 3, 4, 5],
'B': [5, 6, 7, 8, 9],
'C': [10, 20, 30, 40, 50]
}
df = pd.DataFrame(data)
Using the describe()
method on this DataFrame is straightforward:
df.describe()
The output will look something like this:
A B C
count 5.000000 5.000000 5.000000
mean 3.000000 7.000000 30.000000
std 1.581139 1.581139 15.811388
min 1.000000 5.000000 10.000000
25% 2.000000 6.000000 20.000000
50% 3.000000 7.000000 30.000000
75% 4.000000 8.000000 40.000000
max 5.000000 9.000000 50.000000
Understanding the Output
- count: The number of non-null entries in each column.
- mean: The average of each column.
- std: The standard deviation of each column, which measures the amount of variation or dispersion of the data.
- min: The minimum value in each column.
- 25%: The 25th percentile (first quartile), which is the value below which 25% of the data falls.
- 50%: The 50th percentile (median), which is the value below which 50% of the data falls.
- 75%: The 75th percentile (third quartile), which is the value below which 75% of the data falls.
- max: The maximum value in each column.
- unique: Number of unique categories.
- top: Most frequent category.
- freq: Frequency of the most frequent category.
Customizing the describe() Method
The describe()
method can be customized to include different types of data or to show more specific statistics.
Including Non-Numeric Columns
By default, describe()
only includes numeric columns. If your DataFrame includes non-numeric data and you want to include it in the summary, you can use the include
parameter:
data = {
'A': [1, 2, 3, 4, 5],
'B': [5, 6, 7, 8, 9],
'C': [10, 20, 30, 40, 50],
'D': ['a', 'b', 'c', 'd', 'e']
}
df = pd.DataFrame(data)
df.describe(include='all')
This will include the non-numeric column in the output, showing different statistics such as the number of unique values, the most frequent value, and its frequency.
Output:
A B C D
count 5.000000 5.000000 5.000000 5
unique NaN NaN NaN 5
top NaN NaN NaN a
freq NaN NaN NaN 1
mean 3.000000 7.000000 30.000000 NaN
std 1.581139 1.581139 15.811388 NaN
min 1.000000 5.000000 10.000000 NaN
25% 2.000000 6.000000 20.000000 NaN
50% 3.000000 7.000000 30.000000 NaN
75% 4.000000 8.000000 40.000000 NaN
max 5.000000 9.000000 50.000000 NaN
Descriptive Statistics for Specific Columns
You can also specify particular columns to include in the summary
The describe
method in Pandas can be used with various include
parameters to specify which columns to include in the summary. Here’s a detailed explanation of the different include
parameters:
1. ‘include=None’ (default)
When include=None
, describe
provides a summary of all numeric columns in the DataFrame.
df.describe() # or df.describe(include=None)
Output:
A B C
count 5.000000 5.000000 5.000000
mean 3.000000 7.000000 30.000000
std 1.581139 1.581139 15.811388
min 1.000000 5.000000 10.000000
25% 2.000000 6.000000 20.000000
50% 3.000000 7.000000 30.000000
75% 4.000000 8.000000 40.000000
max 5.000000 9.000000 50.000000
2. include=[‘O’] or include=’object’
When include=['O']
or include='object'
, describe
provides a summary of all object (string) columns.
df.describe(include=['O'])
Output:
D
count 5
unique 5
top a
freq 1
3. include=[‘number’] or include=’number’
When include=['number']
or include='number'
, describe
provides a summary of all numeric columns.
df.describe(include=['number'])
Output:
A B C
count 5.000000 5.000000 5.000000
mean 3.000000 7.000000 30.000000
std 1.581139 1.581139 15.811388
min 1.000000 5.000000 10.000000
25% 2.000000 6.000000 20.000000
50% 3.000000 7.000000 30.000000
75% 4.000000 8.000000 40.000000
max 5.000000 9.000000 50.000000
4. include=[‘all’]
When include=['all']
, describe
provides a summary of all columns, regardless of their data type.
df.describe(include='all')
Output:
A B C D
count 5.000000 5.000000 5.000000 5
unique NaN NaN NaN 5
top NaN NaN NaN a
freq NaN NaN NaN 1
mean 3.000000 7.000000 30.000000 NaN
std 1.581139 1.581139 15.811388 NaN
min 1.000000 5.000000 10.000000 NaN
25% 2.000000 6.000000 20.000000 NaN
50% 3.000000 7.000000 30.000000 NaN
75% 4.000000 8.000000 40.000000 NaN
max 5.000000 9.000000 50.000000 NaN
- For numeric columns (A, B, C): Provides the same statistics as
include=None
. - For object columns (D): Provides the same statistics as
include=['O']
.
5. include=[‘category’]
When include=['category']
, describe
provides a summary of all categorical columns.
data = {
'A': [1, 2, 3, 4, 5],
'B': [5, 6, 7, 8, 9],
'C': pd.Categorical(['small', 'large', 'large', 'small', 'medium'])
}
df = pd.DataFrame(data)
df.describe(include=['category'])
Output:
C
count 5
unique 3
top large
freq 2
By using the include
parameter in the describe
method, you can tailor the summary statistics to focus on specific columns or data types, providing a clear and concise overview of your dataset’s characteristics.
Handling Missing Data
If your DataFrame contains missing data, describe()
will still work, but it will ignore NaN
values by default. You can manage missing data before using describe()
using methods like dropna()
or fillna()
:
# Dropping rows with any NaN values
df.dropna().describe()
Here, dropna()
removes all rows that contain any NaN values and describe()
generates descriptive statistics for the remaining numeric columns.
# Filling NaN values with a specific value
df.fillna(0).describe()
Here, fillna(0)
replaces all NaN values with 0 and describe()
generates descriptive statistics for the numeric columns after filling NaN values with 0.
- dropna(): Useful when you want to remove any rows that contain missing values, resulting in a potentially smaller DataFrame but with no NaNs.
- fillna(0): Replaces NaNs with a specific value (0 in this case), retaining the original size of the DataFrame but changing the data distribution by filling in the missing values.
Conclusion
The describe()
method in Pandas is a powerful tool for getting a quick statistical summary of your dataset. It’s an essential part of the data analysis workflow, helping you understand the distribution, central tendencies, and variability of your data. By customizing the method with parameters like include
, you can tailor the output to better suit your needs, whether you’re dealing with numeric or non-numeric data.
By leveraging the describe()
method, you can quickly gain insights into your data, which can inform further analysis and decision-making. Whether you’re a beginner or an experienced data scientist, understanding and using describe()
effectively is a valuable skill in your data manipulation toolkit.
Read Also : Pandas DataFrame mean() method – Explained with examples