Pandas Day 5: GroupBy, agg(), transform() & Group Filtering

Table of Contents

📌 Day 5: GroupBy & Aggregations H2

Prerequisite Setup H3

1️⃣ Group data by a single column and comput… H2

Understanding GroupBy: Split-Apply-Combine H3

Basic Single Column GroupBy H3

Multiple aggregations on single group H3

Different aggregations for different columns H3

Group by with describe() (quick statistics) H3

2️⃣ Group by multiple columns and get size() H2

Group by two columns H3

Get group sizes (count per group) H3

Multiple aggregations with size H3

Using value_counts() as alternative H3

Group by with nunique() - count unique values H3

3️⃣ Use agg() with multiple functions on dif… H2

Basic agg() with multiple functions H3

Different functions for different columns H3

Using lambda functions in agg() H3

Multiple functions on multiple columns H3

Using numpy functions in agg() H3

Real-world example: Sales analysis H3

4️⃣ Apply transform to add group means as ne… H2

What is transform()? H3

Adding group means to each row H3

Compare individual vs group average H3

Multiple transform operations H3

Transform with custom functions H3

Percentile rank within group H3

Transform with multiple columns H3

5️⃣ Filter groups that meet a condition (e.g… H2

Filter groups by size H3

Filter by aggregate condition H3

Filter by multiple conditions H3

Filter using any() and all() H3

Real-world filter examples H3

Advanced filtering with quantiles H3

🎯 Complete Hands-on Checklist H2

❌ Common Mistakes & Solutions H2

📝 Practice Exercises H2

✅ Summary of Day 5 H2

🎓 Pro Tips H2

🚀 Advanced Example: Putting It All Together H2

Quick Stats

Words: 2,070

Est. time: 11 min

Readability: Advanced

Word Cloud

stabulation metaph PLUVIOMETRY muscological inadvertences noneditor ruptuary MONTE kneepiece bene photobromide senoufo widespreading MEER disanchor

📌 Day 5: GroupBy & Aggregations

Prerequisite Setup

import pandas as pd
import numpy as np

# Create a rich dataset for groupby examples
np.random.seed(42)

# Employee dataset
df = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry', 'Ivy', 'Jack'],
    'Department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR'],
    'Region': ['North', 'South', 'North', 'East', 'West', 'South', 'East', 'West', 'North', 'East'],
    'Salary': [50000, 60000, 75000, 55000, 68000, 72000, 58000, 65000, 71000, 53000],
    'Age': [25, 30, 35, 28, 32, 40, 33, 29, 31, 27],
    'Experience': [2, 5, 8, 3, 6, 12, 4, 5, 7, 2],
    'Performance_Score': [85, 92, 78, 88, 95, 82, 90, 87, 91, 84]
})

print("Original Dataset:")
print(df)
print("\n" + "="*60)

1️⃣ Group data by a single column and compute mean

Understanding GroupBy: Split-Apply-Combine

Split data into groups based on criteria
Apply a function to each group
Combine results into a new DataFrame

Basic Single Column GroupBy

# Group by Department and calculate mean Salary
dept_mean_salary = df.groupby('Department')['Salary'].mean()
print("Average Salary by Department:")
print(dept_mean_salary)

Output:

Department
HR      55333.33
IT      64333.33
Sales    67000.00
Name: Salary, dtype: float64

Multiple aggregations on single group

# Group by Department and get mean of multiple columns
dept_stats = df.groupby('Department')[['Salary', 'Age', 'Performance_Score']].mean()
print("\nMultiple column means by Department:")
print(dept_stats.round(2))

Different aggregations for different columns

# Using agg() for more control
dept_aggregations = df.groupby('Department').agg({
    'Salary': 'mean',
    'Age': 'median',
    'Performance_Score': 'max',
    'Experience': 'min'
})
print("\nDifferent aggregations per column:")
print(dept_aggregations.round(2))

Group by with describe() (quick statistics)

# Get full statistical summary by group
dept_summary = df.groupby('Department')['Salary'].describe()
print("\nSalary statistics by Department:")
print(dept_summary)

2️⃣ Group by multiple columns and get size()

Group by two columns

# Group by Department AND Region
dept_region_stats = df.groupby(['Department', 'Region'])['Salary'].mean()
print("Average Salary by Department and Region:")
print(dept_region_stats)

Output:

Department  Region
HR          East     58000.00
            East? Actually let's see actual data...

Get group sizes (count per group)

# Count employees in each Department-Region combination
group_counts = df.groupby(['Department', 'Region']).size()
print("\nNumber of employees by Department and Region:")
print(group_counts)

# As DataFrame with column name
group_counts_df = df.groupby(['Department', 'Region']).size().reset_index(name='Count')
print("\nAs DataFrame:")
print(group_counts_df)

Multiple aggregations with size

# Get mean salary AND count of employees
dept_region_agg = df.groupby(['Department', 'Region']).agg({
    'Salary': ['mean', 'count'],
    'Performance_Score': 'mean'
})
print("\nMultiple aggregations:")
print(dept_region_agg.round(2))

Using value_counts() as alternative

# Quick way to get frequencies
dept_counts = df['Department'].value_counts()
print("\nDepartment counts using value_counts():")
print(dept_counts)

# Two-way frequency table
region_dept_counts = pd.crosstab(df['Region'], df['Department'])
print("\nCross tabulation (Region × Department):")
print(region_dept_counts)

Group by with nunique() - count unique values

# Count unique employees per department (though all are unique here)
# Better example - count unique regions per department
unique_regions = df.groupby('Department')['Region'].nunique()
print("\nNumber of unique regions per department:")
print(unique_regions)

3️⃣ Use agg() with multiple functions on different columns

Basic agg() with multiple functions

# Apply multiple functions to same column
dept_salary_stats = df.groupby('Department')['Salary'].agg(['mean', 'median', 'std', 'min', 'max'])
print("Salary statistics by department:")
print(dept_salary_stats.round(2))

Different functions for different columns

# Custom aggregation per column
dept_custom_agg = df.groupby('Department').agg(
    Avg_Salary=('Salary', 'mean'),
    Max_Performance=('Performance_Score', 'max'),
    Min_Age=('Age', 'min'),
    Total_Experience=('Experience', 'sum'),
    Employee_Count=('Employee', 'count')
)
print("\nCustom named aggregations:")
print(dept_custom_agg.round(2))

Using lambda functions in agg()

# Custom aggregation functions
dept_lambda_agg = df.groupby('Department').agg(
    Salary_Range=('Salary', lambda x: x.max() - x.min()),
    Performance_Variance=('Performance_Score', lambda x: x.var()),
    Age_IQR=('Age', lambda x: x.quantile(0.75) - x.quantile(0.25))
)
print("\nLambda aggregations:")
print(dept_lambda_agg.round(2))

Multiple functions on multiple columns

# Complex aggregation with multiple columns and functions
complex_agg = df.groupby('Department')[['Salary', 'Performance_Score']].agg(
    ['mean', 'std', 'min', 'max']
)
print("\nMultiple functions on multiple columns:")
print(complex_agg.round(2))

# Flatten column names
complex_agg.columns = ['_'.join(col).strip() for col in complex_agg.columns.values]
print("\nWith flattened column names:")
print(complex_agg.round(2))

Using numpy functions in agg()

import numpy as np

dept_numpy_agg = df.groupby('Department').agg(
    Salary_Mean=np.mean,
    Salary_Median=np.median,
    Salary_Std=np.std,
    Performance_Q3=('Performance_Score', lambda x: np.percentile(x, 75))
)
print("\nUsing numpy functions:")
print(dept_numpy_agg.round(2))

Real-world example: Sales analysis

# Create sales dataset
sales_data = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=100, freq='D'),
    'Product': np.random.choice(['A', 'B', 'C'], 100),
    'Region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'Sales': np.random.randint(100, 1000, 100),
    'Quantity': np.random.randint(1, 20, 100)
})

# Complex aggregation
sales_analysis = sales_data.groupby(['Product', 'Region']).agg(
    Total_Sales=('Sales', 'sum'),
    Avg_Sales=('Sales', 'mean'),
    Total_Quantity=('Quantity', 'sum'),
    Avg_Quantity=('Quantity', 'mean'),
    Num_Transactions=('Sales', 'count'),
    Sales_Std=('Sales', 'std')
).round(2)

print("\nSales Analysis by Product and Region:")
print(sales_analysis.head(10))

4️⃣ Apply transform to add group means as new column

What is transform()?

transform() returns a Series/DataFrame with the same index as the original, making it perfect for adding group statistics back to original data.

Adding group means to each row

# Add department average salary to each employee row
df['Dept_Avg_Salary'] = df.groupby('Department')['Salary'].transform('mean')
print("Original with department average salary:")
print(df[['Employee', 'Department', 'Salary', 'Dept_Avg_Salary']].head(10))

Compare individual vs group average

# Calculate difference from department average
df['Salary_vs_Dept_Avg'] = df['Salary'] - df['Dept_Avg_Salary']
print("\nSalary difference from department average:")
print(df[['Employee', 'Department', 'Salary', 'Dept_Avg_Salary', 'Salary_vs_Dept_Avg']].head(10))

Multiple transform operations

# Add multiple group statistics
df['Dept_Median_Salary'] = df.groupby('Department')['Salary'].transform('median')
df['Dept_Salary_Std'] = df.groupby('Department')['Salary'].transform('std')
df['Dept_Max_Performance'] = df.groupby('Department')['Performance_Score'].transform('max')

print("\nMultiple transform columns:")
print(df[['Employee', 'Department', 'Salary', 'Dept_Avg_Salary', 'Dept_Median_Salary', 'Dept_Salary_Std']].head())

Transform with custom functions

# Rank employees within their department by salary
df['Dept_Salary_Rank'] = df.groupby('Department')['Salary'].transform(
    lambda x: x.rank(ascending=False)
)

# Normalize salary within department (0-1 scale)
def normalize_group(x):
    return (x - x.min()) / (x.max() - x.min())

df['Dept_Salary_Normalized'] = df.groupby('Department')['Salary'].transform(normalize_group)

print("\nRank and normalized salary within department:")
print(df[['Employee', 'Department', 'Salary', 'Dept_Salary_Rank', 'Dept_Salary_Normalized']].head())

Percentile rank within group

# Calculate percentile rank within department
df['Dept_Salary_Percentile'] = df.groupby('Department')['Salary'].transform(
    lambda x: x.rank(pct=True)
)

print("\nSalary percentile within department:")
print(df[['Employee', 'Department', 'Salary', 'Dept_Salary_Percentile']].head())

Transform with multiple columns

# Create performance score relative to department
df['Performance_vs_Dept'] = df.groupby('Department')['Performance_Score'].transform(
    lambda x: (x - x.mean()) / x.std()
)

print("\nPerformance z-score within department:")
print(df[['Employee', 'Department', 'Performance_Score', 'Performance_vs_Dept']].head())

5️⃣ Filter groups that meet a condition (e.g., count > 5)

Filter groups by size

# First, create dataset with varying group sizes
df_large = pd.concat([df] * 3, ignore_index=True)  # Triplicate data
print(f"Original size: {len(df)}, New size: {len(df_large)}")

# Add some variation to make groups different
df_large['Department'] = df_large['Department'] + df_large.index.map(
    lambda x: '' if x < 15 else ' Extra' if x < 25 else ' Temp'
)

# Filter departments with more than 5 employees
large_depts = df_large.groupby('Department').filter(lambda x: len(x) > 5)
print("\nDepartments with >5 employees:")
print(large_depts['Department'].value_counts())

Filter by aggregate condition

# Keep only departments with average salary > 60000
high_salary_depts = df.groupby('Department').filter(lambda x: x['Salary'].mean() > 60000)
print("\nDepartments with average salary > 60000:")
print(high_salary_depts[['Department', 'Salary']])
print(f"\nOriginal rows: {len(df)}, Filtered rows: {len(high_salary_depts)}")

Filter by multiple conditions

# Departments with average salary > 60000 AND at least 3 employees
filtered_depts = df.groupby('Department').filter(
    lambda x: (x['Salary'].mean() > 60000) and (len(x) >= 3)
)
print("\nDepartments meeting both conditions:")
print(filtered_depts['Department'].unique())

Filter using any() and all()

# Keep departments where ANY employee has Performance_Score > 90
high_performers = df.groupby('Department').filter(lambda x: (x['Performance_Score'] > 90).any())
print("\nDepartments with any employee scoring >90:")
print(high_performers['Department'].unique())

# Keep departments where ALL employees have Age > 25
all_adults = df.groupby('Department').filter(lambda x: (x['Age'] > 25).all())
print("\nDepartments where all employees >25 years old:")
print(all_adults['Department'].unique())

Real-world filter examples

# Create customer purchase data
np.random.seed(42)
purchase_data = pd.DataFrame({
    'Customer': np.random.choice(['C1', 'C2', 'C3', 'C4', 'C5'], 50),
    'Product': np.random.choice(['P1', 'P2', 'P3'], 50),
    'Amount': np.random.randint(10, 500, 50),
    'Quantity': np.random.randint(1, 10, 50)
})

print("\nPurchase Data Sample:")
print(purchase_data.head())

# Filter customers with total purchase > 1000
high_value_customers = purchase_data.groupby('Customer').filter(
    lambda x: x['Amount'].sum() > 1000
)
print(f"\nCustomers with total >1000: {high_value_customers['Customer'].unique()}")

# Filter customers who bought more than 5 items total
bulk_buyers = purchase_data.groupby('Customer').filter(
    lambda x: x['Quantity'].sum() > 20
)
print(f"Bulk buyers (>20 items): {bulk_buyers['Customer'].unique()}")

# Filter products sold at least 10 times
popular_products = purchase_data.groupby('Product').filter(
    lambda x: len(x) >= 10
)
print(f"Popular products (≥10 sales): {popular_products['Product'].unique()}")

Advanced filtering with quantiles

# Filter groups in top 50% by average salary
median_salary = df.groupby('Department')['Salary'].mean().median()
top_depts = df.groupby('Department').filter(
    lambda x: x['Salary'].mean() > median_salary
)
print("\nDepartments above median salary:")
print(top_depts.groupby('Department')['Salary'].mean())

🎯 Complete Hands-on Checklist

import pandas as pd
import numpy as np

# Load comprehensive dataset
df = pd.DataFrame({
    'Store': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'],
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Mouse', 'Keyboard'],
    'Sales': [1000, 200, 150, 1200, 250, 180, 900, 220, 160],
    'Quantity': [10, 20, 15, 12, 25, 18, 9, 22, 16],
    'Profit': [200, 50, 40, 250, 60, 45, 180, 55, 42]
})

print("="*60)
print("1. GROUP BY SINGLE COLUMN - Mean Sales by Store")
print("="*60)
store_sales = df.groupby('Store')['Sales'].mean()
print(store_sales)

print("\n" + "="*60)
print("2. GROUP BY MULTIPLE COLUMNS - Size by Store & Product")
print("="*60)
store_product_counts = df.groupby(['Store', 'Product']).size()
print(store_product_counts)

print("\n" + "="*60)
print("3. AGG() - Multiple functions on different columns")
print("="*60)
store_agg = df.groupby('Store').agg(
    Total_Sales=('Sales', 'sum'),
    Avg_Profit=('Profit', 'mean'),
    Total_Quantity=('Quantity', 'sum'),
    Num_Products=('Product', 'count')
)
print(store_agg)

print("\n" + "="*60)
print("4. TRANSFORM - Add store average sales to each row")
print("="*60)
df['Store_Avg_Sales'] = df.groupby('Store')['Sales'].transform('mean')
df['Sales_vs_Store_Avg'] = df['Sales'] - df['Store_Avg_Sales']
print(df[['Store', 'Product', 'Sales', 'Store_Avg_Sales', 'Sales_vs_Store_Avg']])

print("\n" + "="*60)
print("5. FILTER - Keep stores with total sales > 2000")
print("="*60)
high_sales_stores = df.groupby('Store').filter(lambda x: x['Sales'].sum() > 2000)
print(high_sales_stores[['Store', 'Product', 'Sales']])

❌ Common Mistakes & Solutions

Mistake	Why it fails	Correct way
`df.groupby('col').mean()` returns all columns	Means all numeric columns	Specify column: `df.groupby('col')['target'].mean()`
Forgetting `reset_index()` after groupby	Group keys become index	Use `reset_index()` or `as_index=False`
Using `transform()` when you need `agg()`	Transform returns same shape	Transform = same index, Agg = reduced shape
`filter()` without lambda	Must use function	`groupby().filter(lambda x: condition)`
Confusing `size()` vs `count()`	Size includes NaN, count excludes	Use `size()` for total, `count()` for non-NA

📝 Practice Exercises

Group the employee dataset by Region and calculate mean, median, and std of Salary.
Find which Department-Region combination has the highest average Performance_Score.
Add a column showing how each employee's salary ranks within their department (1 = highest).
Filter to show only departments where the minimum salary is above 55,000.
Using the sales dataset, calculate:
- Total sales per product
- Average quantity per store
- Profit margin (Profit/Sales) per product

✅ Summary of Day 5

Operation	Method	Use Case
Basic grouping	`df.groupby('col')['val'].mean()`	Simple aggregations
Multiple columns	`df.groupby(['col1','col2']).size()`	Hierarchical grouping
Multiple functions	`df.groupby('col').agg(['mean','std'])`	Summary statistics
Custom aggregations	`df.groupby('col').agg(mean=('val','mean'))`	Named outputs
Add group stats	`df.groupby('col')['val'].transform('mean')`	Feature engineering
Filter groups	`df.groupby('col').filter(lambda x: condition)`	Subset data

🎓 Pro Tips

Use as_index=False to keep grouping columns as columns instead of index
Chain operations: df.groupby('dept').filter(cond).groupby('dept').mean()
pd.NamedAgg for explicit column naming in agg()
Use groupby.ngroups to check number of groups
groupby.groups returns dictionary of group indices for debugging

🚀 Advanced Example: Putting It All Together

# Complete analysis pipeline
result = (df
    .groupby(['Store', 'Product'])
    .agg(
        Total_Sales=('Sales', 'sum'),
        Avg_Profit=('Profit', 'mean'),
        Units_Sold=('Quantity', 'sum')
    )
    .round(2)
    .reset_index()
    .assign(Profit_Margin=lambda x: x['Avg_Profit'] / x['Total_Sales'] * 100)
    .sort_values('Total_Sales', ascending=False)
)

print("\nFinal Analysis Pipeline Result:")
print(result)

Pandas Day 5 — GroupBy & Aggregations