The var() method in Pandas is used to calculate the variance of the values in a DataFrame. Variance is a statistical measurement of the spread between numbers in a data set. The var()
method helps to understand how data points in a dataset are spread out from their mean.
Syntax
DataFrame.var(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
Parameters
- axis: {0 or ‘index’, 1 or ‘columns’}, default 0
- Axis along which the variance is computed. The default is 0 (compute along columns).
- skipna: bool, default True
- Exclude NA/null values. If True, it excludes the missing values during calculation.
- level: int or level name, default None
- If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame.
- ddof: int, default 1
- Delta Degrees of Freedom. The divisor used in calculations is
N - ddof
, whereN
represents the number of elements. - numeric_only: bool, default None
- Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.
Returns
- Series or DataFrame: If
level
is specified, returns a DataFrame; otherwise, returns a Series.
Examples
Let’s dive into some examples to understand how to use the var()
method.
Example 1: Basic Usage
Consider the following DataFrame:
import pandas as pd
data = {
'A': [1, 2, 3, 4, 5],
'B': [5, 6, 7, 8, 9],
'C': [9, 10, 11, 12, 13]
}
df = pd.DataFrame(data)
print(df)
Output:
A B C
0 1 5 9
1 2 6 10
2 3 7 11
3 4 8 12
4 5 9 13
To calculate the variance of each column:
variance = df.var()
print(variance)
Output:
A 2.5
B 2.5
C 2.5
dtype: float64
In this example, the variance for each column (A, B, and C) is 2.5. This means the data points in each column are spread out from their mean by an average of 2.5 units squared.
Example 2: Calculating Variance Along Rows
You can also calculate the variance along rows by setting the axis
parameter to 1.
row_variance = df.var(axis=1)
print(row_variance)
Output:
0 16.0
1 16.0
2 16.0
3 16.0
4 16.0
dtype: float64
In this example, the variance along each row is 16.0. This indicates that within each row, the data points have an average spread of 16.0 units squared from their mean.
Example 3: Handling Missing Values
The skipna
parameter can be used to exclude NA/null values. Consider the following DataFrame with missing values:
data_with_nan = {
'A': [1, 2, None, 4, 5],
'B': [5, None, 7, 8, 9],
'C': [9, 10, 11, None, 13]
}
df_nan = pd.DataFrame(data_with_nan)
print(df_nan)
Output:
A B C
0 1.0 5.0 9.0
1 2.0 NaN 10.0
2 NaN 7.0 11.0
3 4.0 8.0 NaN
4 5.0 9.0 13.0
By default, skipna=True
, so NA/null values are excluded:
variance_nan = df_nan.var()
print(variance_nan)
Output:
A 3.333333
B 2.916667
C 2.916667
dtype: float64
If you set skipna=False
, the method will return NA for columns with missing values:
variance_nan_include = df_nan.var(skipna=False)
print(variance_nan_include)
Output:
A NaN
B NaN
C NaN
dtype: float64
In the above examples, when missing values are excluded (skipna=True
), the variances for columns A and C are approximately 2.92, and for column B is 2.0. If missing values are included (skipna=False
), the variance cannot be calculated for columns with any missing data, resulting in NaN values.
Example 4: Using ddof
Parameter
The ddof
parameter allows you to set the degrees of freedom for the calculation. By default, ddof=1
.
variance_ddof0 = df.var(ddof=0)
print(variance_ddof0)
Output:
A 2.0
B 2.0
C 2.0
dtype: float64
In this example, using ddof=0
changes the divisor to N
instead of N-1
, resulting in a variance of 2.0 for each column, showing a slightly lower spread of data points.
Conclusion
The var()
method in Pandas is a powerful tool for statistical analysis, allowing you to compute the variance of your data along a specified axis while handling missing values and providing flexibility with degrees of freedom. Understanding and utilizing this method effectively can help you gain insights into the variability of your datasets.
Also Explore: