Pandas DataFrame var() Method – Explained with Examples

The var() method in Pandas is used to calculate the variance of the values in a DataFrame. Variance is a statistical measurement of the spread between numbers in a data set. The var() method helps to understand how data points in a dataset are spread out from their mean.

Syntax
Python
DataFrame.var(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
Parameters
  • axis: {0 or ‘index’, 1 or ‘columns’}, default 0
  • Axis along which the variance is computed. The default is 0 (compute along columns).
  • skipna: bool, default True
  • Exclude NA/null values. If True, it excludes the missing values during calculation.
  • level: int or level name, default None
  • If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame.
  • ddof: int, default 1
  • Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
  • numeric_only: bool, default None
  • Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.
Returns
  • Series or DataFrame: If level is specified, returns a DataFrame; otherwise, returns a Series.

Examples

Let’s dive into some examples to understand how to use the var() method.

Example 1: Basic Usage

Consider the following DataFrame:

Python
import pandas as pd

data = {
    'A': [1, 2, 3, 4, 5],
    'B': [5, 6, 7, 8, 9],
    'C': [9, 10, 11, 12, 13]
}

df = pd.DataFrame(data)
print(df)

Output:

   A  B   C
0  1  5   9
1  2  6  10
2  3  7  11
3  4  8  12
4  5  9  13

To calculate the variance of each column:

Python
variance = df.var()
print(variance)

Output:

A    2.5
B    2.5
C    2.5
dtype: float64

In this example, the variance for each column (A, B, and C) is 2.5. This means the data points in each column are spread out from their mean by an average of 2.5 units squared.

Example 2: Calculating Variance Along Rows

You can also calculate the variance along rows by setting the axis parameter to 1.

Python
row_variance = df.var(axis=1)
print(row_variance)

Output:

0    16.0
1    16.0
2    16.0
3    16.0
4    16.0
dtype: float64

In this example, the variance along each row is 16.0. This indicates that within each row, the data points have an average spread of 16.0 units squared from their mean.

Example 3: Handling Missing Values

The skipna parameter can be used to exclude NA/null values. Consider the following DataFrame with missing values:

Python
data_with_nan = {
    'A': [1, 2, None, 4, 5],
    'B': [5, None, 7, 8, 9],
    'C': [9, 10, 11, None, 13]
}

df_nan = pd.DataFrame(data_with_nan)
print(df_nan)

Output:

     A    B     C
0  1.0  5.0   9.0
1  2.0  NaN  10.0
2  NaN  7.0  11.0
3  4.0  8.0   NaN
4  5.0  9.0  13.0

By default, skipna=True, so NA/null values are excluded:

Python
variance_nan = df_nan.var()
print(variance_nan)

Output:

A    3.333333
B    2.916667
C    2.916667
dtype: float64

If you set skipna=False, the method will return NA for columns with missing values:

Python
variance_nan_include = df_nan.var(skipna=False)
print(variance_nan_include)

Output:

A   NaN
B   NaN
C   NaN
dtype: float64

In the above examples, when missing values are excluded (skipna=True), the variances for columns A and C are approximately 2.92, and for column B is 2.0. If missing values are included (skipna=False), the variance cannot be calculated for columns with any missing data, resulting in NaN values.

Example 4: Using ddof Parameter

The ddof parameter allows you to set the degrees of freedom for the calculation. By default, ddof=1.

Python
variance_ddof0 = df.var(ddof=0)
print(variance_ddof0)

Output:

A    2.0
B    2.0
C    2.0
dtype: float64

In this example, using ddof=0 changes the divisor to N instead of N-1, resulting in a variance of 2.0 for each column, showing a slightly lower spread of data points.

Conclusion

The var() method in Pandas is a powerful tool for statistical analysis, allowing you to compute the variance of your data along a specified axis while handling missing values and providing flexibility with degrees of freedom. Understanding and utilizing this method effectively can help you gain insights into the variability of your datasets.

Also Explore:

Leave a Comment