Pandas Day 4: apply(), map(), melt() & Pivot Tables

Table of Contents

📌 Day 4: Data Transformation - Apply, Map &… H2

Prerequisite Setup H3

1️⃣ Use apply() to normalize a column to 0-1… H2

What is apply()? H3

Understanding Normalization (Min-Max Scaling) H3

Method 1: Normalize single column with apply… H3

Method 2: More concise with lambda function H3

Method 3: Normalize multiple columns at once H3

Method 4: Row-wise apply (axis=1) H3

Method 5: Vectorized operation (FASTER than … H3

Performance comparison H3

2️⃣ Map categorical values to numeric codes H2

What is map()? H3

Method 1: Simple mapping with dictionary H3

Method 2: Bidirectional mapping H3

Method 3: Map with function for complex logic H3

Method 4: Using replace() for multiple mappi… H3

Method 5: Create categorical codes automatic… H3

Method 6: Map multiple columns at once H3

3️⃣ Rename multiple columns using dictionary H2

Method 1: Basic column renaming H3

Method 2: Rename using function (str methods) H3

Method 3: Rename with string replacement H3

Method 4: Rename with pattern matching H3

Method 5: Selective renaming with condition H3

4️⃣ Convert wide format to long using melt H2

What is melt()? H3

Create a wide-format dataset H3

Method 1: Basic melt (all columns except ide… H3

Method 2: Melt specific columns H3

Method 3: Real-world example - Student scores H3

Method 4: Multiple ID variables H3

Method 5: Handling missing values in melt H3

5️⃣ Create pivot table from long format data H2

What is pivot_table()? H3

Method 1: Basic pivot table H3

Method 2: Student scores pivot table H3

Method 3: Real-world sales analysis H3

Method 4: Hierarchical pivot tables H3

Method 5: Pivot with margins (totals) H3

Method 6: Fill missing values in pivot H3

🎯 Complete Hands-on Checklist H2

❌ Common Mistakes & Solutions H2

📝 Practice Exercises H2

✅ Summary of Day 4 H2

🎓 Pro Tips H2

Quick Stats

Words: 2,326

Est. time: 12 min

Readability: Advanced

Word Cloud

snip scathelessly orthometry DANDYLING easterling politarch CRINATED selt graviers disamenity COGNIZABLY archly HYLARCHICAL taurodont realmless

📌 Day 4: Data Transformation - Apply, Map & Reshape

Prerequisite Setup

import pandas as pd
import numpy as np

# Create sample datasets for all examples
np.random.seed(42)

# Main dataset for apply() and map()
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Department': ['Sales', 'IT', 'Sales', 'HR', 'IT'],
    'Salary': [50000, 60000, 75000, 55000, 68000],
    'Age': [25, 30, 35, 28, 32],
    'Experience': [2, 5, 8, 3, 6],
    'Performance': [85, 92, 78, 88, 95]
})

print("Original DataFrame:")
print(df)
print("\n" + "="*60)

1️⃣ Use `apply()` to normalize a column to 0-1 range

What is `apply()`?

apply() applies a function along an axis of the DataFrame (rows or columns). It's extremely flexible for custom transformations.

Understanding Normalization (Min-Max Scaling)

Formula: (x - min) / (max - min) → converts any range to [0, 1]

Method 1: Normalize single column with `apply()`

# Normalize Salary to 0-1 range
def normalize_series(x):
    """Normalize a single value to 0-1 range"""
    min_val = df['Salary'].min()
    max_val = df['Salary'].max()
    return (x - min_val) / (max_val - min_val)

# Apply to Salary column
df['Salary_Normalized'] = df['Salary'].apply(normalize_series)

print("Normalized Salary (0-1 range):")
print(df[['Name', 'Salary', 'Salary_Normalized']])

Output:

    Name  Salary  Salary_Normalized
0   Alice   50000                0.0
1     Bob   60000                0.4
2  Charlie   75000                1.0
3   Diana   55000                0.2
4     Eve   68000                0.72

Method 2: More concise with lambda function

# Using lambda (one-liner)
min_sal = df['Salary'].min()
max_sal = df['Salary'].max()

df['Salary_Norm_Lambda'] = df['Salary'].apply(lambda x: (x - min_sal) / (max_sal - min_sal))
print("\nWith lambda function:")
print(df[['Name', 'Salary', 'Salary_Norm_Lambda']])

Method 3: Normalize multiple columns at once

# Normalize all numeric columns
def normalize_column(column):
    return (column - column.min()) / (column.max() - column.min())

# Apply to multiple columns
numeric_cols = ['Salary', 'Age', 'Experience', 'Performance']
df_normalized = df.copy()
df_normalized[numeric_cols] = df_normalized[numeric_cols].apply(normalize_column)

print("\nAll columns normalized (0-1 range):")
print(df_normalized[['Name'] + numeric_cols])

Method 4: Row-wise apply (axis=1)

# Create a composite score using row-wise operation
def calculate_composite_score(row):
    """Calculate weighted score from multiple columns"""
    salary_weight = 0.3
    experience_weight = 0.4
    performance_weight = 0.3
    
    # Normalize within row context (using global min/max)
    salary_norm = (row['Salary'] - df['Salary'].min()) / (df['Salary'].max() - df['Salary'].min())
    exp_norm = (row['Experience'] - df['Experience'].min()) / (df['Experience'].max() - df['Experience'].min())
    perf_norm = (row['Performance'] - df['Performance'].min()) / (df['Performance'].max() - df['Performance'].min())
    
    return (salary_weight * salary_norm + 
            experience_weight * exp_norm + 
            performance_weight * perf_norm)

df['Composite_Score'] = df.apply(calculate_composite_score, axis=1)
print("\nComposite scores (row-wise apply):")
print(df[['Name', 'Salary', 'Experience', 'Performance', 'Composite_Score']])

Method 5: Vectorized operation (FASTER than apply)

# Vectorized normalization (no apply needed - much faster for large data)
min_sal = df['Salary'].min()
max_sal = df['Salary'].max()
df['Salary_Vectorized'] = (df['Salary'] - min_sal) / (max_sal - min_sal)

print("\nVectorized normalization (faster):")
print(df[['Name', 'Salary', 'Salary_Vectorized']])

Performance comparison

# Create large dataset for performance test
large_df = pd.DataFrame({'values': np.random.randn(100000)})

# Using apply
%timeit large_df['values'].apply(lambda x: (x - x.min()) / (x.max() - x.min()))

# Using vectorized
%timeit (large_df['values'] - large_df['values'].min()) / (large_df['values'].max() - large_df['values'].min())

# Vectorized is 10-100x faster!

2️⃣ Map categorical values to numeric codes

What is `map()`?

map() replaces values in a Series using a dictionary or function.

Method 1: Simple mapping with dictionary

# Map Department to numeric codes
dept_mapping = {
    'Sales': 1,
    'IT': 2,
    'HR': 3
}

df['Dept_Code'] = df['Department'].map(dept_mapping)

print("Department to Numeric Mapping:")
print(df[['Name', 'Department', 'Dept_Code']])

Method 2: Bidirectional mapping

# Create reverse mapping for later use
reverse_mapping = {v: k for k, v in dept_mapping.items()}

# Map performance ratings
performance_mapping = {
    85: 'Good',
    92: 'Excellent',
    78: 'Average',
    88: 'Good',
    95: 'Excellent'
}

df['Performance_Rating'] = df['Performance'].map(performance_mapping)

print("\nPerformance numeric to text:")
print(df[['Name', 'Performance', 'Performance_Rating']])

Method 3: Map with function for complex logic

# Map Age to age groups using function
def age_group(age):
    if age < 28:
        return 'Young'
    elif age < 33:
        return 'Mid'
    else:
        return 'Senior'

df['Age_Group'] = df['Age'].map(age_group)

print("\nAge groups:")
print(df[['Name', 'Age', 'Age_Group']])

Method 4: Using `replace()` for multiple mappings

# Replace multiple values at once
df['Salary_Bracket'] = df['Salary'].replace({
    50000: 'Entry',
    55000: 'Junior',
    60000: 'Mid',
    68000: 'Senior',
    75000: 'Expert'
})

print("\nSalary brackets:")
print(df[['Name', 'Salary', 'Salary_Bracket']])

Method 5: Create categorical codes automatically

# Convert to categorical and get codes
df['Department_Category'] = pd.Categorical(df['Department'])
df['Dept_Auto_Code'] = df['Department_Category'].cat.codes

print("\nAutomatic categorical coding:")
print(df[['Name', 'Department', 'Dept_Auto_Code']])

Method 6: Map multiple columns at once

# Create a more complex mapping example
df_extended = pd.DataFrame({
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'Quality': ['High', 'Medium', 'Low', 'High'],
    'Region': ['North', 'South', 'East', 'West']
})

# Multiple mappings
quality_score = {'High': 3, 'Medium': 2, 'Low': 1}
region_code = {'North': 1, 'South': 2, 'East': 3, 'West': 4}

df_extended['Quality_Score'] = df_extended['Quality'].map(quality_score)
df_extended['Region_Code'] = df_extended['Region'].map(region_code)

print("\nMultiple column mapping:")
print(df_extended)

3️⃣ Rename multiple columns using dictionary

Method 1: Basic column renaming

# Create a fresh copy
df_renamed = df.copy()

# Rename using dictionary
column_mapping = {
    'Name': 'Employee_Name',
    'Dept_Code': 'Department_Code',
    'Performance_Rating': 'Rating'
}

df_renamed = df_renamed.rename(columns=column_mapping)
print("Renamed columns:")
print(df_renamed.columns.tolist())
print("\nFirst few rows:")
print(df_renamed.head())

Method 2: Rename using function (str methods)

# Convert all column names to uppercase
df_upper = df.rename(columns=str.upper)
print("\nUppercase columns:")
print(df_upper.columns.tolist())

# Convert to lowercase
df_lower = df.rename(columns=str.lower)
print("\nLowercase columns:")
print(df_lower.columns.tolist())

Method 3: Rename with string replacement

# Replace spaces with underscores
df_spaces = pd.DataFrame({
    'First Name': ['Alice', 'Bob'],
    'Last Name': ['Smith', 'Jones'],
    'Annual Salary': [50000, 60000]
})

print("\nOriginal (with spaces):")
print(df_spaces.columns.tolist())

# Remove spaces
df_clean = df_spaces.rename(columns=lambda x: x.replace(' ', '_'))
print("Cleaned:")
print(df_clean.columns.tolist())

Method 4: Rename with pattern matching

# Add prefix to all columns
df_prefixed = df.rename(columns=lambda x: f'emp_{x}')
print("\nWith prefix:")
print(df_prefixed.columns.tolist())

# Remove prefix
df_unprefixed = df_prefixed.rename(columns=lambda x: x.replace('emp_', ''))
print("Prefix removed:")
print(df_unprefixed.columns.tolist())

Method 5: Selective renaming with condition

# Rename only columns with certain condition
def rename_if_long(col):
    if len(col) > 8:
        return col[:8]  # Truncate long names
    return col

df_truncated = df.rename(columns=rename_if_long)
print("\nTruncated long names:")
print(df_truncated.columns.tolist())

4️⃣ Convert wide format to long using melt

What is `melt()`?

melt() unpivots data from wide format (many columns) to long format (fewer columns, more rows). Essential for data analysis and visualization.

Create a wide-format dataset

# Create sales data in wide format
sales_wide = pd.DataFrame({
    'Product': ['Laptop', 'Mouse', 'Keyboard'],
    'Q1_Sales': [1000, 500, 300],
    'Q2_Sales': [1200, 550, 320],
    'Q3_Sales': [1100, 600, 350],
    'Q4_Sales': [1300, 580, 340]
})

print("Wide Format Data (one row per product):")
print(sales_wide)

Method 1: Basic melt (all columns except identifier)

# Melt from wide to long
sales_long = pd.melt(
    sales_wide,
    id_vars=['Product'],  # Columns to keep as identifiers
    var_name='Quarter',    # Name for the variable column
    value_name='Sales'     # Name for the value column
)

print("\nLong Format Data (one row per product-quarter):")
print(sales_long)

Method 2: Melt specific columns

# Melt only specific columns
sales_partial = pd.melt(
    sales_wide,
    id_vars=['Product'],
    value_vars=['Q1_Sales', 'Q2_Sales'],  # Only melt these
    var_name='Quarter',
    value_name='Sales'
)

print("\nMelt only Q1 and Q2:")
print(sales_partial)

Method 3: Real-world example - Student scores

# Student scores in wide format
scores_wide = pd.DataFrame({
    'Student': ['Alice', 'Bob', 'Charlie'],
    'Math_Score': [85, 92, 78],
    'Science_Score': [88, 89, 82],
    'English_Score': [90, 85, 88]
})

print("\nStudent scores (wide):")
print(scores_wide)

# Melt to long format for easier analysis
scores_long = pd.melt(
    scores_wide,
    id_vars=['Student'],
    var_name='Subject',
    value_name='Score'
)

# Clean up subject names
scores_long['Subject'] = scores_long['Subject'].str.replace('_Score', '')

print("\nStudent scores (long format - better for plotting):")
print(scores_long)

# Now easy to group by subject
print("\nAverage score by subject:")
print(scores_long.groupby('Subject')['Score'].mean())

Method 4: Multiple ID variables

# Multi-dimensional data
multi_wide = pd.DataFrame({
    'Year': [2020, 2021, 2022],
    'City': ['NYC', 'LA', 'Chicago'],
    'Product_A': [100, 110, 120],
    'Product_B': [200, 210, 220]
})

print("\nMulti-dimensional wide data:")
print(multi_wide)

# Melt with multiple ID vars
multi_long = pd.melt(
    multi_wide,
    id_vars=['Year', 'City'],
    var_name='Product',
    value_name='Sales'
)

print("\nMelted with multiple IDs:")
print(multi_long)

Method 5: Handling missing values in melt

# Data with missing values
wide_with_na = pd.DataFrame({
    'Product': ['A', 'B'],
    'Store1': [100, np.nan],
    'Store2': [150, 200],
    'Store3': [np.nan, 250]
})

print("\nWide with missing values:")
print(wide_with_na)

# Melt and then handle missing
long_with_na = pd.melt(
    wide_with_na,
    id_vars=['Product'],
    var_name='Store',
    value_name='Sales'
)

print("\nLong format with NAs:")
print(long_with_na)

# Remove NA rows
long_clean = long_with_na.dropna()
print("\nAfter removing NAs:")
print(long_clean)

5️⃣ Create pivot table from long format data

What is `pivot_table()`?

pivot_table() creates a spreadsheet-style pivot table from long-format data. It's the inverse of melt().

Method 1: Basic pivot table

# Use the long format data from previous example
print("Original long format:")
print(sales_long.head())

# Pivot back to wide format
sales_pivoted = sales_long.pivot_table(
    index='Product',      # Rows
    columns='Quarter',    # Columns
    values='Sales',       # Values to aggregate
    aggfunc='sum'         # Aggregation function
)

print("\nPivoted back to wide format:")
print(sales_pivoted)

Method 2: Student scores pivot table

# Using the scores_long data
print("Original scores_long:")
print(scores_long)

# Create pivot table
score_pivot = scores_long.pivot_table(
    index='Student',
    columns='Subject',
    values='Score'
)

print("\nPivot table - Student scores by subject:")
print(score_pivot)

# Add row and column totals
score_pivot['Average'] = score_pivot.mean(axis=1)
print("\nWith student averages:")
print(score_pivot)

Method 3: Real-world sales analysis

# Create realistic sales data
sales_data = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=12, freq='M'),
    'Product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
    'Region': ['North', 'North', 'South', 'South', 'East', 'East', 'West', 'West', 'North', 'South', 'East', 'West'],
    'Sales': [100, 150, 120, 130, 110, 140, 130, 160, 105, 125, 115, 145]
})

print("\nSales data (long format):")
print(sales_data.head(10))

# Pivot by Region and Product
sales_by_region_product = sales_data.pivot_table(
    index='Region',
    columns='Product',
    values='Sales',
    aggfunc='mean'  # Average sales
)

print("\nAverage sales by Region and Product:")
print(sales_by_region_product)

# Pivot with multiple aggregations
sales_multi_agg = sales_data.pivot_table(
    index='Region',
    columns='Product',
    values='Sales',
    aggfunc=['mean', 'sum', 'count']
)

print("\nMultiple aggregations:")
print(sales_multi_agg)

Method 4: Hierarchical pivot tables

# Create more complex data
complex_data = pd.DataFrame({
    'Year': [2020, 2020, 2021, 2021, 2020, 2020, 2021, 2021],
    'Quarter': ['Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2'],
    'Product': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
    'Region': ['North', 'North', 'South', 'South', 'East', 'East', 'West', 'West'],
    'Revenue': [100, 110, 120, 130, 140, 150, 160, 170]
})

print("\nComplex hierarchical data:")
print(complex_data)

# Multi-level pivot
complex_pivot = complex_data.pivot_table(
    index=['Year', 'Quarter'],
    columns=['Product', 'Region'],
    values='Revenue',
    aggfunc='sum'
)

print("\nHierarchical pivot table:")
print(complex_pivot)

Method 5: Pivot with margins (totals)

# Add total rows and columns
pivot_with_margins = sales_data.pivot_table(
    index='Region',
    columns='Product',
    values='Sales',
    aggfunc='sum',
    margins=True,           # Add totals
    margins_name='Total'    # Name for total row/col
)

print("\nPivot table with totals:")
print(pivot_with_margins)

Method 6: Fill missing values in pivot

# Data with missing combinations
incomplete_data = pd.DataFrame({
    'Product': ['A', 'A', 'B', 'C', 'C'],
    'Store': ['X', 'Y', 'X', 'X', 'Z'],
    'Sales': [100, 150, 200, 120, 180]
})

print("\nIncomplete data:")
print(incomplete_data)

# Pivot with fill value
pivot_filled = incomplete_data.pivot_table(
    index='Product',
    columns='Store',
    values='Sales',
    aggfunc='sum',
    fill_value=0  # Fill missing with 0
)

print("\nPivot with missing filled:")
print(pivot_filled)

🎯 Complete Hands-on Checklist

import pandas as pd
import numpy as np

# Create comprehensive dataset
df = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Department': ['Sales', 'IT', 'Sales', 'HR', 'IT'],
    'Base_Salary': [50000, 60000, 75000, 55000, 68000],
    'Bonus': [5000, 6000, 7500, 5500, 6800],
    'Experience': [2, 5, 8, 3, 6],
    'Performance': ['A', 'B', 'C', 'B', 'A']
})

print("="*60)
print("1. APPLY - Normalize Base_Salary to 0-1")
print("="*60)
min_sal = df['Base_Salary'].min()
max_sal = df['Base_Salary'].max()
df['Salary_Norm'] = df['Base_Salary'].apply(lambda x: (x - min_sal) / (max_sal - min_sal))
print(df[['Employee', 'Base_Salary', 'Salary_Norm']])

print("\n" + "="*60)
print("2. MAP - Convert Performance to numeric codes")
print("="*60)
perf_mapping = {'A': 5, 'B': 3, 'C': 1}
df['Performance_Code'] = df['Performance'].map(perf_mapping)
print(df[['Employee', 'Performance', 'Performance_Code']])

print("\n" + "="*60)
print("3. RENAME - Multiple columns")
print("="*60)
df_renamed = df.rename(columns={
    'Employee': 'Emp_Name',
    'Base_Salary': 'Salary',
    'Performance_Code': 'Perf_Score'
})
print("New column names:", df_renamed.columns.tolist())

print("\n" + "="*60)
print("4. MELT - Convert wide to long")
print("="*60)
salary_wide = df[['Employee', 'Base_Salary', 'Bonus']]
salary_long = pd.melt(
    salary_wide,
    id_vars=['Employee'],
    var_name='Compensation_Type',
    value_name='Amount'
)
print(salary_long)

print("\n" + "="*60)
print("5. PIVOT_TABLE - Create summary")
print("="*60)
pivot_summary = df.pivot_table(
    index='Department',
    columns='Performance',
    values='Base_Salary',
    aggfunc='mean'
)
print(pivot_summary)

❌ Common Mistakes & Solutions

Mistake	Why it fails	Correct way
Using `apply()` when vectorized works	Very slow for large data	Use vectorized operations when possible
Forgetting `axis=1` for row operations	Applies to columns by default	`df.apply(func, axis=1)` for rows
`map()` returns NaN for unmapped values	Missing keys return NaN	Use `fillna()` after mapping or ensure all keys mapped
`melt()` without `id_vars`	Melts everything	Specify which columns to keep as identifiers
Pivot without `aggfunc`	Error with duplicate entries	Always specify `aggfunc` (usually 'sum' or 'mean')

📝 Practice Exercises

Create a function to normalize Age to 0-1 range using apply()
Map Education levels ['High School', 'Bachelor', 'Master', 'PhD'] to [1,2,3,4]
Rename columns to snake_case from CamelCase

Convert this wide data to long:

data = {'Student': ['A','B'], 'Math':[90,85], 'Science':[88,92]}

Create a pivot table showing average sales by Region and Product category

✅ Summary of Day 4

Operation	Method	Key Parameters
Apply function	`df.apply(func)`	`axis=0/1`
Map values	`series.map(dict)`	Dictionary mapping
Rename columns	`df.rename(columns=dict)`	`columns`, `inplace`
Wide to long	`pd.melt()`	`id_vars`, `var_name`, `value_name`
Long to wide	`df.pivot_table()`	`index`, `columns`, `values`, `aggfunc`

🎓 Pro Tips

Prefer vectorized operations over apply() for simple math (10-100x faster)
Chain transformations for clean code: df.rename().pipe(another_func)
Use pipe() for complex transformation pipelines
Always verify your melt/pivot operations with small test data first
pd.cut() is better than custom mapping for binning continuous data

Pandas Day 4 — Data Transformation: Apply, Map & Reshape

📌 Day 4: Data Transformation - Apply, Map & Reshape

Prerequisite Setup

1️⃣ Use apply() to normalize a column to 0-1 range

What is apply()?

Understanding Normalization (Min-Max Scaling)

Method 1: Normalize single column with apply()