📌 Day 5: GroupBy & Aggregations
Prerequisite Setup
import pandas as pd import numpy as np # Create a rich dataset for groupby examples np.random.seed(42) # Employee dataset df = pd.DataFrame({ 'Employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry', 'Ivy', 'Jack'], 'Department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR'], 'Region': ['North', 'South', 'North', 'East', 'West', 'South', 'East', 'West', 'North', 'East'], 'Salary': [50000, 60000, 75000, 55000, 68000, 72000, 58000, 65000, 71000, 53000], 'Age': [25, 30, 35, 28, 32, 40, 33, 29, 31, 27], 'Experience': [2, 5, 8, 3, 6, 12, 4, 5, 7, 2], 'Performance_Score': [85, 92, 78, 88, 95, 82, 90, 87, 91, 84] }) print("Original Dataset:") print(df) print("\n" + "="*60)
1️⃣ Group data by a single column and compute mean
Understanding GroupBy: Split-Apply-Combine
Split data into groups based on criteria
Apply a function to each group
Combine results into a new DataFrame
Basic Single Column GroupBy
# Group by Department and calculate mean Salary dept_mean_salary = df.groupby('Department')['Salary'].mean() print("Average Salary by Department:") print(dept_mean_salary)
Output:
Department HR 55333.33 IT 64333.33 Sales 67000.00 Name: Salary, dtype: float64
Multiple aggregations on single group
# Group by Department and get mean of multiple columns dept_stats = df.groupby('Department')[['Salary', 'Age', 'Performance_Score']].mean() print("\nMultiple column means by Department:") print(dept_stats.round(2))
Different aggregations for different columns
# Using agg() for more control dept_aggregations = df.groupby('Department').agg({ 'Salary': 'mean', 'Age': 'median', 'Performance_Score': 'max', 'Experience': 'min' }) print("\nDifferent aggregations per column:") print(dept_aggregations.round(2))
Group by with describe() (quick statistics)
# Get full statistical summary by group dept_summary = df.groupby('Department')['Salary'].describe() print("\nSalary statistics by Department:") print(dept_summary)
2️⃣ Group by multiple columns and get size()
Group by two columns
# Group by Department AND Region dept_region_stats = df.groupby(['Department', 'Region'])['Salary'].mean() print("Average Salary by Department and Region:") print(dept_region_stats)
Output:
Department Region
HR East 58000.00
East? Actually let's see actual data...Get group sizes (count per group)
# Count employees in each Department-Region combination group_counts = df.groupby(['Department', 'Region']).size() print("\nNumber of employees by Department and Region:") print(group_counts) # As DataFrame with column name group_counts_df = df.groupby(['Department', 'Region']).size().reset_index(name='Count') print("\nAs DataFrame:") print(group_counts_df)
Multiple aggregations with size
# Get mean salary AND count of employees dept_region_agg = df.groupby(['Department', 'Region']).agg({ 'Salary': ['mean', 'count'], 'Performance_Score': 'mean' }) print("\nMultiple aggregations:") print(dept_region_agg.round(2))
Using value_counts() as alternative
# Quick way to get frequencies dept_counts = df['Department'].value_counts() print("\nDepartment counts using value_counts():") print(dept_counts) # Two-way frequency table region_dept_counts = pd.crosstab(df['Region'], df['Department']) print("\nCross tabulation (Region × Department):") print(region_dept_counts)
Group by with nunique() - count unique values
# Count unique employees per department (though all are unique here) # Better example - count unique regions per department unique_regions = df.groupby('Department')['Region'].nunique() print("\nNumber of unique regions per department:") print(unique_regions)
3️⃣ Use agg() with multiple functions on different columns
Basic agg() with multiple functions
# Apply multiple functions to same column dept_salary_stats = df.groupby('Department')['Salary'].agg(['mean', 'median', 'std', 'min', 'max']) print("Salary statistics by department:") print(dept_salary_stats.round(2))
Different functions for different columns
# Custom aggregation per column dept_custom_agg = df.groupby('Department').agg( Avg_Salary=('Salary', 'mean'), Max_Performance=('Performance_Score', 'max'), Min_Age=('Age', 'min'), Total_Experience=('Experience', 'sum'), Employee_Count=('Employee', 'count') ) print("\nCustom named aggregations:") print(dept_custom_agg.round(2))
Using lambda functions in agg()
# Custom aggregation functions dept_lambda_agg = df.groupby('Department').agg( Salary_Range=('Salary', lambda x: x.max() - x.min()), Performance_Variance=('Performance_Score', lambda x: x.var()), Age_IQR=('Age', lambda x: x.quantile(0.75) - x.quantile(0.25)) ) print("\nLambda aggregations:") print(dept_lambda_agg.round(2))
Multiple functions on multiple columns
# Complex aggregation with multiple columns and functions complex_agg = df.groupby('Department')[['Salary', 'Performance_Score']].agg( ['mean', 'std', 'min', 'max'] ) print("\nMultiple functions on multiple columns:") print(complex_agg.round(2)) # Flatten column names complex_agg.columns = ['_'.join(col).strip() for col in complex_agg.columns.values] print("\nWith flattened column names:") print(complex_agg.round(2))
Using numpy functions in agg()
import numpy as np dept_numpy_agg = df.groupby('Department').agg( Salary_Mean=np.mean, Salary_Median=np.median, Salary_Std=np.std, Performance_Q3=('Performance_Score', lambda x: np.percentile(x, 75)) ) print("\nUsing numpy functions:") print(dept_numpy_agg.round(2))
Real-world example: Sales analysis
# Create sales dataset sales_data = pd.DataFrame({ 'Date': pd.date_range('2024-01-01', periods=100, freq='D'), 'Product': np.random.choice(['A', 'B', 'C'], 100), 'Region': np.random.choice(['North', 'South', 'East', 'West'], 100), 'Sales': np.random.randint(100, 1000, 100), 'Quantity': np.random.randint(1, 20, 100) }) # Complex aggregation sales_analysis = sales_data.groupby(['Product', 'Region']).agg( Total_Sales=('Sales', 'sum'), Avg_Sales=('Sales', 'mean'), Total_Quantity=('Quantity', 'sum'), Avg_Quantity=('Quantity', 'mean'), Num_Transactions=('Sales', 'count'), Sales_Std=('Sales', 'std') ).round(2) print("\nSales Analysis by Product and Region:") print(sales_analysis.head(10))
4️⃣ Apply transform to add group means as new column
What is transform()?
transform() returns a Series/DataFrame with the same index as the original, making it perfect for adding group statistics back to original data.
Adding group means to each row
# Add department average salary to each employee row df['Dept_Avg_Salary'] = df.groupby('Department')['Salary'].transform('mean') print("Original with department average salary:") print(df[['Employee', 'Department', 'Salary', 'Dept_Avg_Salary']].head(10))
Compare individual vs group average
# Calculate difference from department average df['Salary_vs_Dept_Avg'] = df['Salary'] - df['Dept_Avg_Salary'] print("\nSalary difference from department average:") print(df[['Employee', 'Department', 'Salary', 'Dept_Avg_Salary', 'Salary_vs_Dept_Avg']].head(10))
Multiple transform operations
# Add multiple group statistics df['Dept_Median_Salary'] = df.groupby('Department')['Salary'].transform('median') df['Dept_Salary_Std'] = df.groupby('Department')['Salary'].transform('std') df['Dept_Max_Performance'] = df.groupby('Department')['Performance_Score'].transform('max') print("\nMultiple transform columns:") print(df[['Employee', 'Department', 'Salary', 'Dept_Avg_Salary', 'Dept_Median_Salary', 'Dept_Salary_Std']].head())
Transform with custom functions
# Rank employees within their department by salary df['Dept_Salary_Rank'] = df.groupby('Department')['Salary'].transform( lambda x: x.rank(ascending=False) ) # Normalize salary within department (0-1 scale) def normalize_group(x): return (x - x.min()) / (x.max() - x.min()) df['Dept_Salary_Normalized'] = df.groupby('Department')['Salary'].transform(normalize_group) print("\nRank and normalized salary within department:") print(df[['Employee', 'Department', 'Salary', 'Dept_Salary_Rank', 'Dept_Salary_Normalized']].head())
Percentile rank within group
# Calculate percentile rank within department df['Dept_Salary_Percentile'] = df.groupby('Department')['Salary'].transform( lambda x: x.rank(pct=True) ) print("\nSalary percentile within department:") print(df[['Employee', 'Department', 'Salary', 'Dept_Salary_Percentile']].head())
Transform with multiple columns
# Create performance score relative to department df['Performance_vs_Dept'] = df.groupby('Department')['Performance_Score'].transform( lambda x: (x - x.mean()) / x.std() ) print("\nPerformance z-score within department:") print(df[['Employee', 'Department', 'Performance_Score', 'Performance_vs_Dept']].head())
5️⃣ Filter groups that meet a condition (e.g., count > 5)
Filter groups by size
# First, create dataset with varying group sizes df_large = pd.concat([df] * 3, ignore_index=True) # Triplicate data print(f"Original size: {len(df)}, New size: {len(df_large)}") # Add some variation to make groups different df_large['Department'] = df_large['Department'] + df_large.index.map( lambda x: '' if x < 15 else ' Extra' if x < 25 else ' Temp' ) # Filter departments with more than 5 employees large_depts = df_large.groupby('Department').filter(lambda x: len(x) > 5) print("\nDepartments with >5 employees:") print(large_depts['Department'].value_counts())
Filter by aggregate condition
# Keep only departments with average salary > 60000 high_salary_depts = df.groupby('Department').filter(lambda x: x['Salary'].mean() > 60000) print("\nDepartments with average salary > 60000:") print(high_salary_depts[['Department', 'Salary']]) print(f"\nOriginal rows: {len(df)}, Filtered rows: {len(high_salary_depts)}")
Filter by multiple conditions
# Departments with average salary > 60000 AND at least 3 employees filtered_depts = df.groupby('Department').filter( lambda x: (x['Salary'].mean() > 60000) and (len(x) >= 3) ) print("\nDepartments meeting both conditions:") print(filtered_depts['Department'].unique())
Filter using any() and all()
# Keep departments where ANY employee has Performance_Score > 90 high_performers = df.groupby('Department').filter(lambda x: (x['Performance_Score'] > 90).any()) print("\nDepartments with any employee scoring >90:") print(high_performers['Department'].unique()) # Keep departments where ALL employees have Age > 25 all_adults = df.groupby('Department').filter(lambda x: (x['Age'] > 25).all()) print("\nDepartments where all employees >25 years old:") print(all_adults['Department'].unique())
Real-world filter examples
# Create customer purchase data np.random.seed(42) purchase_data = pd.DataFrame({ 'Customer': np.random.choice(['C1', 'C2', 'C3', 'C4', 'C5'], 50), 'Product': np.random.choice(['P1', 'P2', 'P3'], 50), 'Amount': np.random.randint(10, 500, 50), 'Quantity': np.random.randint(1, 10, 50) }) print("\nPurchase Data Sample:") print(purchase_data.head()) # Filter customers with total purchase > 1000 high_value_customers = purchase_data.groupby('Customer').filter( lambda x: x['Amount'].sum() > 1000 ) print(f"\nCustomers with total >1000: {high_value_customers['Customer'].unique()}") # Filter customers who bought more than 5 items total bulk_buyers = purchase_data.groupby('Customer').filter( lambda x: x['Quantity'].sum() > 20 ) print(f"Bulk buyers (>20 items): {bulk_buyers['Customer'].unique()}") # Filter products sold at least 10 times popular_products = purchase_data.groupby('Product').filter( lambda x: len(x) >= 10 ) print(f"Popular products (≥10 sales): {popular_products['Product'].unique()}")
Advanced filtering with quantiles
# Filter groups in top 50% by average salary median_salary = df.groupby('Department')['Salary'].mean().median() top_depts = df.groupby('Department').filter( lambda x: x['Salary'].mean() > median_salary ) print("\nDepartments above median salary:") print(top_depts.groupby('Department')['Salary'].mean())
🎯 Complete Hands-on Checklist
import pandas as pd import numpy as np # Load comprehensive dataset df = pd.DataFrame({ 'Store': ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C'], 'Product': ['Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Mouse', 'Keyboard', 'Laptop', 'Mouse', 'Keyboard'], 'Sales': [1000, 200, 150, 1200, 250, 180, 900, 220, 160], 'Quantity': [10, 20, 15, 12, 25, 18, 9, 22, 16], 'Profit': [200, 50, 40, 250, 60, 45, 180, 55, 42] }) print("="*60) print("1. GROUP BY SINGLE COLUMN - Mean Sales by Store") print("="*60) store_sales = df.groupby('Store')['Sales'].mean() print(store_sales) print("\n" + "="*60) print("2. GROUP BY MULTIPLE COLUMNS - Size by Store & Product") print("="*60) store_product_counts = df.groupby(['Store', 'Product']).size() print(store_product_counts) print("\n" + "="*60) print("3. AGG() - Multiple functions on different columns") print("="*60) store_agg = df.groupby('Store').agg( Total_Sales=('Sales', 'sum'), Avg_Profit=('Profit', 'mean'), Total_Quantity=('Quantity', 'sum'), Num_Products=('Product', 'count') ) print(store_agg) print("\n" + "="*60) print("4. TRANSFORM - Add store average sales to each row") print("="*60) df['Store_Avg_Sales'] = df.groupby('Store')['Sales'].transform('mean') df['Sales_vs_Store_Avg'] = df['Sales'] - df['Store_Avg_Sales'] print(df[['Store', 'Product', 'Sales', 'Store_Avg_Sales', 'Sales_vs_Store_Avg']]) print("\n" + "="*60) print("5. FILTER - Keep stores with total sales > 2000") print("="*60) high_sales_stores = df.groupby('Store').filter(lambda x: x['Sales'].sum() > 2000) print(high_sales_stores[['Store', 'Product', 'Sales']])
❌ Common Mistakes & Solutions
| Mistake | Why it fails | Correct way |
|---|---|---|
df.groupby('col').mean() returns all columns | Means all numeric columns | Specify column: df.groupby('col')['target'].mean() |
Forgetting reset_index() after groupby | Group keys become index | Use reset_index() or as_index=False |
Using transform() when you need agg() | Transform returns same shape | Transform = same index, Agg = reduced shape |
filter() without lambda | Must use function | groupby().filter(lambda x: condition) |
Confusing size() vs count() | Size includes NaN, count excludes | Use size() for total, count() for non-NA |
📝 Practice Exercises
Group the employee dataset by Region and calculate mean, median, and std of Salary.
Find which Department-Region combination has the highest average Performance_Score.
Add a column showing how each employee's salary ranks within their department (1 = highest).
Filter to show only departments where the minimum salary is above 55,000.
Using the sales dataset, calculate:
Total sales per product
Average quantity per store
Profit margin (Profit/Sales) per product
✅ Summary of Day 5
| Operation | Method | Use Case |
|---|---|---|
| Basic grouping | df.groupby('col')['val'].mean() | Simple aggregations |
| Multiple columns | df.groupby(['col1','col2']).size() | Hierarchical grouping |
| Multiple functions | df.groupby('col').agg(['mean','std']) | Summary statistics |
| Custom aggregations | df.groupby('col').agg(mean=('val','mean')) | Named outputs |
| Add group stats | df.groupby('col')['val'].transform('mean') | Feature engineering |
| Filter groups | df.groupby('col').filter(lambda x: condition) | Subset data |
🎓 Pro Tips
Use
as_index=Falseto keep grouping columns as columns instead of indexChain operations:
df.groupby('dept').filter(cond).groupby('dept').mean()pd.NamedAggfor explicit column naming in agg()Use
groupby.ngroupsto check number of groupsgroupby.groupsreturns dictionary of group indices for debugging
🚀 Advanced Example: Putting It All Together
# Complete analysis pipeline result = (df .groupby(['Store', 'Product']) .agg( Total_Sales=('Sales', 'sum'), Avg_Profit=('Profit', 'mean'), Units_Sold=('Quantity', 'sum') ) .round(2) .reset_index() .assign(Profit_Margin=lambda x: x['Avg_Profit'] / x['Total_Sales'] * 100) .sort_values('Total_Sales', ascending=False) ) print("\nFinal Analysis Pipeline Result:") print(result)
Discussion (0)