MadiisAttendance
GroupBy Operations and Data Aggregation in Pandas 11 min read 2,070 words 217 sentences uni

Pandas Day 5 — GroupBy & Aggregations

M Usman May 26, 2026
12 0 0 score
0%
Your Progress
0/42 sections
Reading Speed
0
words/min
Time Spent
00:00
Table of Contents

Quick Stats
Words: 2,070
Est. time: 11 min
Readability: Advanced

Word Cloud
stabulation metaph PLUVIOMETRY muscological inadvertences noneditor ruptuary MONTE kneepiece bene photobromide senoufo widespreading MEER disanchor

📌 Day 5: GroupBy & Aggregations

Prerequisite Setup

python
import pandas as pd
import numpy as np

# Create a rich dataset for groupby examples
np.random.seed(42)

# Employee dataset
df = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry', 'Ivy', 'Jack'],
    'Department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR'],
    'Region': ['North', 'South', 'North', 'East', 'West', 'South', 'East', 'West', 'North', 'East'],
    'Salary': [50000, 60000, 75000, 55000, 68000, 72000, 58000, 65000, 71000, 53000],
    'Age': [25, 30, 35, 28, 32, 40, 33, 29, 31, 27],
    'Experience': [2, 5, 8, 3, 6, 12, 4, 5, 7, 2],
    'Performance_Score': [85, 92, 78, 88, 95, 82, 90, 87, 91, 84]
})

print("Original Dataset:")
print(df)
print("\n" + "="*60)

1️⃣ Group data by a single column and compute mean

Understanding GroupBy: Split-Apply-Combine

  1. Split data into groups based on criteria

  2. Apply a function to each group

  3. Combine results into a new DataFrame

Basic Single Column GroupBy

python
# Group by Department and calculate mean Salary
dept_mean_salary = df.groupby('Department')['Salary'].mean()
print("Average Salary by Department:")
print(dept_mean_salary)

Output:

text
Department
HR      55333.33
IT      64333.33
Sales    67000.00
Name: Salary, dtype: float64

Multiple aggregations on single group

python
# Group by Department and get mean of multiple columns
dept_stats = df.groupby('Department')[['Salary', 'Age', 'Performance_Score']].mean()
print("\nMultiple column means by Department:")
print(dept_stats.round(2))

Different aggregations for different columns

python
# Using agg() for more control
dept_aggregations = df.groupby('Department').agg({
    'Salary': 'mean',
    'Age': 'median',
    'Performance_Score': 'max',
    'Experience': 'min'
})
print("\nDifferent aggregations per column:")
print(dept_aggregations.round(2))

Group by with describe() (quick statistics)

python
# Get full statistical summary by group
dept_summary = df.groupby('Department')['Salary'].describe()
print("\nSalary statistics by Department:")
print(dept_summary)

2️⃣ Group by multiple columns and get size()

Group by two columns

python
# Group by Department AND Region
dept_region_stats = df.groupby(['Department', 'Region'])['Salary'].mean()
print("Average Salary by Department and Region:")
print(dept_region_stats)

Output:

text
Department  Region
HR          East     58000.00
            East? Actually let's see actual data...

Get group sizes (count per group)

python
# Count employees in each Department-Region combination
group_counts = df.groupby(['Department', 'Region']).size()
print("\nNumber of employees by Department and Region:")
print(group_counts)

# As DataFrame with column name
group_counts_df = df.groupby(['Department', 'Region']).size().reset_index(name='Count')
print("\nAs DataFrame:")
print(group_counts_df)

Multiple aggregations with size

python
# Get mean salary AND count of employees
dept_region_agg = df.groupby(['Department', 'Region']).agg({
    'Salary': ['mean', 'count'],
    'Performance_Score': 'mean'
})
print("\nMultiple aggregations:")
print(dept_region_agg.round(2))

Using value_counts() as alternative

python
# Quick way to get frequencies
dept_counts = df['Department'].value_counts()
print("\nDepartment counts using value_counts():")
print(dept_counts)

# Two-way frequency table
region_dept_counts = pd.crosstab(df['Region'], df['Department'])
print("\nCross tabulation (Region × Department):")
print(region_dept_counts)

Group by with nunique() - count unique values

python
# Count unique employees per department (though all are unique here)
# Better example - count unique regions per department
unique_regions = df.groupby('Department')['Region'].nunique()
print("\nNumber of unique regions per department:")
print(unique_regions)

3️⃣ Use agg() with multiple functions on different columns

Basic agg() with multiple functions

python
# Apply multiple functions to same column
dept_salary_stats = df.groupby('Department')['Salary'].agg(['mean', 'median', 'std', 'min', 'max'])
print("Salary statistics by department:")
print(dept_salary_stats.round(2))

Different functions for different columns

python
# Custom aggregation per column
dept_custom_agg = df.groupby('Department').agg(
    Avg_Salary=('Salary', 'mean'),
    Max_Performance=('Performance_Score', 'max'),
    Min_Age=('Age', 'min'),
    Total_Experience=('Experience', 'sum'),
    Employee_Count=('Employee', 'count')
)
print("\nCustom named aggregations:")
print(dept_custom_agg.round(2))

Using lambda functions in agg()

python
# Custom aggregation functions
dept_lambda_agg = df.groupby('Department').agg(
    Salary_Range=('Salary', lambda x: x.max() - x.min()),
    Performance_Variance=('Performance_Score', lambda x: x.var()),
    Age_IQR=('Age', lambda x: x.quantile(0.75) - x.quantile(0.25))
)
print("\nLambda aggregations:")
print(dept_lambda_agg.round(2))

Multiple functions on multiple columns

python
# Complex aggregation with multiple columns and functions
complex_agg = df.groupby('Department')[['Salary', 'Performance_Score']].agg(
    ['mean', 'std', 'min', 'max']
)
print("\nMultiple functions on multiple columns:")
print(complex_agg.round(2))

# Flatten column names
complex_agg.columns = ['_'.join(col).strip() for col in complex_agg.columns.values]
print("\nWith flattened column names:")
print(complex_agg.round(2))

Using numpy functions in agg()

python
import numpy as np

dept_numpy_agg = df.groupby('Department').agg(
    Salary_Mean=np.mean,
    Salary_Median=np.median,
    Salary_Std=np.std,
    Performance_Q3=('Performance_Score', lambda x: np.percentile(x, 75))
)
print("\nUsing numpy functions:")
print(dept_numpy_agg.round(2))

Real-world example: Sales analysis

python
# Create sales dataset
sales_data = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=100, freq='D'),
    'Product': np.random.choice(['A', 'B', 'C'], 100),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'Sales': np.random.randint(100, 1000, 100),
    'Quantity': np.random.randint(1, 20, 100)
})

# Complex aggregation
sales_analysis = sales_data.groupby(['Product', 'Region']).agg(
    Total_Sales=('Sales', 'sum'),
    Avg_Sales=('Sales', 'mean'),
    Total_Quantity=('Quantity', 'sum'),
    Avg_Quantity=('Quantity', 'mean'),
    Num_Transactions=('Sales', 'count'),
    Sales_Std=('Sales', 'std')
).round(2)

print("\nSales Analysis by Product and Region:")
print(sales_analysis.head(10))

4️⃣ Apply transform to add group means as new column

What is transform()?

transform() returns a Series/DataFrame with the same index as the original, making it perfect for adding group statistics back to original data.

Adding group means to each row

python
# Add department average salary to each employee row
df['Dept_Avg_Salary'] = df.groupby('Department')['Salary'].transform('mean')
print("Original with department average salary:")
print(df[['Employee', 'Department', 'Salary', 'Dept_Avg_Salary']].head(10))

Compare individual vs group average

python
# Calculate difference from department average
df['Salary_vs_Dept_Avg'] = df['Salary'] - df['Dept_Avg_Salary']
print("\nSalary difference from department average:")
print(df[['Employee', 'Department', 'Salary', 'Dept_Avg_Salary', 'Salary_vs_Dept_Avg']].head(10))

Multiple transform operations

python
# Add multiple group statistics
df['Dept_Median_Salary'] = df.groupby('Department')['Salary'].transform('median')
df['Dept_Salary_Std'] = df.groupby('Department')['Salary'].transform('std')
df['Dept_Max_Performance'] = df.groupby('Department')['Performance_Score'].transform('max')

print("\nMultiple transform columns:")
print(df[['Employee', 'Department', 'Salary', 'Dept_Avg_Salary', 'Dept_Median_Salary', 'Dept_Salary_Std']].head())

Transform with custom functions

python
# Rank employees within their department by salary
df['Dept_Salary_Rank'] = df.groupby('Department')['Salary'].transform(
    lambda x: x.rank(ascending=False)
)

# Normalize salary within department (0-1 scale)
def normalize_group(x):
    return (x - x.min()) / (x.max() - x.min())

df['Dept_Salary_Normalized'] = df.groupby('Department')['Salary'].transform(normalize_group)

print("\nRank and normalized salary within department:")
print(df[['Employee', 'Department', 'Salary', 'Dept_Salary_Rank', 'Dept_Salary_Normalized']].head())

Percentile rank within group

python
# Calculate percentile rank within department
df['Dept_Salary_Percentile'] = df.groupby('Department')['Salary'].transform(
    lambda x: x.rank(pct=True)
)

print("\nSalary percentile within department:")
print(df[['Employee', 'Department', 'Salary', 'Dept_Salary_Percentile']].head())

Transform with multiple columns

python
# Create performance score relative to department
df['Performance_vs_Dept'] = df.groupby('Department')['Performance_Score'].transform(
    lambda x: (x - x.mean()) / x.std()
)

print("\nPerformance z-score within department:")
print(df[['Employee', 'Department', 'Performance_Score', 'Performance_vs_Dept']].head())

5️⃣ Filter groups that meet a condition (e.g., count > 5)

Filter groups by size

python
# First, create dataset with varying group sizes
df_large = pd.concat([df] * 3, ignore_index=True)  # Triplicate data
print(f"Original size: {len(df)}, New size: {len(df_large)}")

# Add some variation to make groups different
df_large['Department'] = df_large['Department'] + df_large.index.map(
    lambda x: '' if x < 15 else ' Extra' if x < 25 else ' Temp'
)

# Filter departments with more than 5 employees
large_depts = df_large.groupby('Department').filter(lambda x: len(x) > 5)
print("\nDepartments with >5 employees:")
print(large_depts['Department'].value_counts())

Filter by aggregate condition

python
# Keep only departments with average salary > 60000
high_salary_depts = df.groupby('Department').filter(lambda x: x['Salary'].mean() > 60000)
print("\nDepartments with average salary > 60000:")
print(high_salary_depts[['Department', 'Salary']])
print(f"\nOriginal rows: {len(df)}, Filtered rows: {len(high_salary_depts)}")

Filter by multiple conditions

python
# Departments with average salary > 60000 AND at least 3 employees
filtered_depts = df.groupby('Department').filter(
    lambda x: (x['Salary'].mean() > 60000) and (len(x) >= 3)
)
print("\nDepartments meeting both conditions:")
print(filtered_depts['Department'].unique())

Filter using any() and all()

python
# Keep departments where ANY employee has Performance_Score > 90
high_performers = df.groupby('Department').filter(lambda x: (x['Performance_Score'] > 90).any())
print("\nDepartments with any employee scoring >90:")
print(high_performers['Department'].unique())

# Keep departments where ALL employees have Age > 25
all_adults = df.groupby('Department').filter(lambda x: (x['Age'] > 25).all())
print("\nDepartments where all employees >25 years old:")
print(all_adults['Department'].unique())

Real-world filter examples

python
# Create customer purchase data
np.random.seed(42)
purchase_data = pd.DataFrame({
    'Customer': np.random.choice(['C1', 'C2', 'C3', 'C4', 'C5'], 50),
    'Product': np.random.choice(['P1', 'P2', 'P3'], 50),
    'Amount': np.random.randint(10, 500, 50),
    'Quantity': np.random.randint(1, 10, 50)
})

print("\nPurchase Data Sample:")
print(purchase_data.head())

# Filter customers with total purchase > 1000
high_value_customers = purchase_data.groupby('Customer').filter(
    lambda x: x['Amount'].sum() > 1000
)
print(f"\nCustomers with total >1000: {high_value_customers['Customer'].unique()}")

# Filter customers who bought more than 5 items total
bulk_buyers = purchase_data.groupby('Customer').filter(
    lambda x: x['Quantity'].sum() > 20
)
print(f"Bulk buyers (>20 items): {bulk_buyers['Customer'].unique()}")

# Filter products sold at least 10 times
popular_products = purchase_data.groupby('Product').filter(
    lambda x: len(x) >= 10
)
print(f"Popular products (≥10 sales): {popular_products['Product'].unique()}")

Advanced filtering with quantiles

python
# Filter groups in top 50% by average salary
median_salary = df.groupby('Department')['Salary'].mean().median()
top_depts = df.groupby('Department').filter(
    lambda x: x['Salary'].mean() > median_salary
)
print("\nDepartments above median salary:")
print(top_depts.groupby('Department')['Salary'].mean())

🎯 Complete Hands-on Checklist

python
import pandas as pd
import numpy as np

# Load comprehensive dataset
df = pd.DataFrame({
    'Store': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Mouse', 'Keyboard'],
    'Sales': [1000, 200, 150, 1200, 250, 180, 900, 220, 160],
    'Quantity': [10, 20, 15, 12, 25, 18, 9, 22, 16],
    'Profit': [200, 50, 40, 250, 60, 45, 180, 55, 42]
})

print("="*60)
print("1. GROUP BY SINGLE COLUMN - Mean Sales by Store")
print("="*60)
store_sales = df.groupby('Store')['Sales'].mean()
print(store_sales)

print("\n" + "="*60)
print("2. GROUP BY MULTIPLE COLUMNS - Size by Store & Product")
print("="*60)
store_product_counts = df.groupby(['Store', 'Product']).size()
print(store_product_counts)

print("\n" + "="*60)
print("3. AGG() - Multiple functions on different columns")
print("="*60)
store_agg = df.groupby('Store').agg(
    Total_Sales=('Sales', 'sum'),
    Avg_Profit=('Profit', 'mean'),
    Total_Quantity=('Quantity', 'sum'),
    Num_Products=('Product', 'count')
)
print(store_agg)

print("\n" + "="*60)
print("4. TRANSFORM - Add store average sales to each row")
print("="*60)
df['Store_Avg_Sales'] = df.groupby('Store')['Sales'].transform('mean')
df['Sales_vs_Store_Avg'] = df['Sales'] - df['Store_Avg_Sales']
print(df[['Store', 'Product', 'Sales', 'Store_Avg_Sales', 'Sales_vs_Store_Avg']])

print("\n" + "="*60)
print("5. FILTER - Keep stores with total sales > 2000")
print("="*60)
high_sales_stores = df.groupby('Store').filter(lambda x: x['Sales'].sum() > 2000)
print(high_sales_stores[['Store', 'Product', 'Sales']])

❌ Common Mistakes & Solutions

MistakeWhy it failsCorrect way
df.groupby('col').mean() returns all columnsMeans all numeric columnsSpecify column: df.groupby('col')['target'].mean()
Forgetting reset_index() after groupbyGroup keys become indexUse reset_index() or as_index=False
Using transform() when you need agg()Transform returns same shapeTransform = same index, Agg = reduced shape
filter() without lambdaMust use functiongroupby().filter(lambda x: condition)
Confusing size() vs count()Size includes NaN, count excludesUse size() for total, count() for non-NA

📝 Practice Exercises

  1. Group the employee dataset by Region and calculate mean, median, and std of Salary.

  2. Find which Department-Region combination has the highest average Performance_Score.

  3. Add a column showing how each employee's salary ranks within their department (1 = highest).

  4. Filter to show only departments where the minimum salary is above 55,000.

  5. Using the sales dataset, calculate:

    • Total sales per product

    • Average quantity per store

    • Profit margin (Profit/Sales) per product


✅ Summary of Day 5

OperationMethodUse Case
Basic groupingdf.groupby('col')['val'].mean()Simple aggregations
Multiple columnsdf.groupby(['col1','col2']).size()Hierarchical grouping
Multiple functionsdf.groupby('col').agg(['mean','std'])Summary statistics
Custom aggregationsdf.groupby('col').agg(mean=('val','mean'))Named outputs
Add group statsdf.groupby('col')['val'].transform('mean')Feature engineering
Filter groupsdf.groupby('col').filter(lambda x: condition)Subset data

🎓 Pro Tips

  1. Use as_index=False to keep grouping columns as columns instead of index

  2. Chain operations: df.groupby('dept').filter(cond).groupby('dept').mean()

  3. pd.NamedAgg for explicit column naming in agg()

  4. Use groupby.ngroups to check number of groups

  5. groupby.groups returns dictionary of group indices for debugging


🚀 Advanced Example: Putting It All Together

python
# Complete analysis pipeline
result = (df
    .groupby(['Store', 'Product'])
    .agg(
        Total_Sales=('Sales', 'sum'),
        Avg_Profit=('Profit', 'mean'),
        Units_Sold=('Quantity', 'sum')
    )
    .round(2)
    .reset_index()
    .assign(Profit_Margin=lambda x: x['Avg_Profit'] / x['Total_Sales'] * 100)
    .sort_values('Total_Sales', ascending=False)
)

print("\nFinal Analysis Pipeline Result:")
print(result)
Discussion (0)
Login to comment
Dictionary

Add New Word

Dictionary Words
My Notes
Highlights
Select text and click highlight to save
My Vocabulary
Quick Quiz
Settings
Reading Analytics
Today's reading: 0 min
Total read time: 0 min
Words learned: 0
Streak: 0 days
AI Summary

Generating summary...