Pandas dataframe groupby() Method – Explained

Pandas is a powerful data manipulation library for Python, widely used in data analysis and data science. One of its most useful features is the groupby() method, which allows you to split your data into groups, apply operations to these groups, and then combine the results. This method is integral for data aggregation and summarization tasks. In this blog, we’ll explore the groupby() method in detail.

1. Introduction to groupby()

The groupby() method is used to split data into groups based on some criteria. It’s similar to the SQL GROUP BY statement. Once the data is split, you can perform various operations on each group independently.


2. Basic Usage

Let’s start with a simple example. Consider a DataFrame with sales data:

Python
import pandas as pd

# Sample data
data = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Sales': [100, 200, 150, 220, 130, 210]
}

df = pd.DataFrame(data)

We can use groupby() to group this data by the ‘Category‘ column and then calculate the sum of sales for each category:

Python
grouped = df.groupby('Category').sum()
print(grouped)

Output:

Markdown
          Sales
Category       
A            380
B            630

3. Aggregation Functions

The groupby() method is often used with aggregation functions like sum(), mean(), count(), etc. Here are a few examples:

  • Sum: df.groupby(‘Category’).sum()
  • Mean: df.groupby(‘Category’).mean()
  • Count: df.groupby(‘Category’).count()
Python
mean_sales = df.groupby('Category').mean()
count_sales = df.groupby('Category').count()
print(mean_sales)
print(count_sales)

Output:

Markdown
               Sales
Category            
A         126.666667
B         210.000000

          Sales
Category       
A             3
B             3

4. Grouping by Multiple Columns

You can also group by multiple columns. Let’s extend our example:

Python
data = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Region': ['North', 'North', 'South', 'South', 'North', 'South'],
    'Sales': [100, 200, 150, 220, 130, 210]
}

df = pd.DataFrame(data)
grouped = df.groupby(['Category', 'Region']).sum()
print(grouped)

Output:

Markdown
                    Sales
Category Region       
A       North      230
        South      150
B       North      200
        South      430

5. Applying Multiple Aggregations

Sometimes, you might want to apply multiple aggregation functions simultaneously. You can achieve this using the agg() method:

Python
aggregated = df.groupby('Category').agg({'Sales': ['sum', 'mean', 'count']})
print(aggregated)

Output:

Markdown
         Sales              
           sum   mean  count
Category                    
A          380  126.67     3
B          630  210.00     3

6. Transformations with groupby()

In addition to aggregations, you can also perform transformations on your grouped data. For example, you can normalize your sales data by subtracting the mean sales of each group:

Python
df['Normalized_Sales'] = df.groupby('Category')['Sales'].transform(lambda x: x - x.mean())
print(df)

Output:

Markdown
  Category Region  Sales  Normalized_Sales
0        A  North    100            -26.67
1        B  North    200            -10.00
2        A  South    150             23.33
3        B  South    220             10.00
4        A  North    130              3.33
5        B  South    210              0.00

7. Filtering Groups

You can filter groups based on certain criteria. For instance, if you only want groups with a total sales greater than 300:

Python
filtered = df.groupby('Category').filter(lambda x: x['Sales'].sum() > 300)
print(filtered)

Output:

Markdown
  Category Region  Sales  Normalized_Sales
0        A  North    100            -26.67
2        A  South    150             23.33
4        A  North    130              3.33
1        B  North    200            -10.00
3        B  South    220             10.00
5        B  South    210              0.00

8. Practical Examples

To conclude, let’s look at a practical example where we use multiple features of groupby():

Python
data = {
    'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
    'Region': ['North', 'North', 'South', 'South', 'North', 'South'],
    'Sales': [100, 200, 150, 220, 130, 210],
    'Discount': [10, 20, 15, 25, 10, 20]
}

df = pd.DataFrame(data)
result = df.groupby(['Category', 'Region']).agg(
    Total_Sales=('Sales', 'sum'),
    Average_Discount=('Discount', 'mean'),
    Sales_Count=('Sales', 'count')
).reset_index()

print(result)

Output:

Markdown
  Category Region  Total_Sales  Average_Discount  Sales_Count
0        A  North          230              10.0            2
1        A  South          150              15.0            1
2        B  North          200              20.0            1
3        B  South          430              22.5            2

In this example, we grouped the data by ‘Category‘ and ‘Region‘, calculated the total sales, average discount, and count of sales for each group.


Conclusion

The groupby() method in Pandas is a powerful tool for data analysis, enabling you to perform complex operations with ease. By mastering this method, you can efficiently aggregate, transform, and filter your data, making it an indispensable part of your data science toolkit.

Experiment with different datasets and aggregation functions to gain a deeper understanding of how groupby() can be utilized in various scenarios. Happy data analyzing!

Explore Also:

Leave a Comment