MadiisAttendance
Data Transformation Techniques in Pandas 12 min read 2,326 words 151 sentences uni

Pandas Day 4 — Data Transformation: Apply, Map & Reshape

M Usman May 26, 2026
6 0 0 score
0%
Your Progress
0/46 sections
Reading Speed
0
words/min
Time Spent
00:00
Table of Contents

Quick Stats
Words: 2,326
Est. time: 12 min
Readability: Advanced

Word Cloud
snip scathelessly orthometry DANDYLING easterling politarch CRINATED selt graviers disamenity COGNIZABLY archly HYLARCHICAL taurodont realmless

📌 Day 4: Data Transformation - Apply, Map & Reshape

Prerequisite Setup

python
import pandas as pd
import numpy as np

# Create sample datasets for all examples
np.random.seed(42)

# Main dataset for apply() and map()
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Department': ['Sales', 'IT', 'Sales', 'HR', 'IT'],
    'Salary': [50000, 60000, 75000, 55000, 68000],
    'Age': [25, 30, 35, 28, 32],
    'Experience': [2, 5, 8, 3, 6],
    'Performance': [85, 92, 78, 88, 95]
})

print("Original DataFrame:")
print(df)
print("\n" + "="*60)

1️⃣ Use apply() to normalize a column to 0-1 range

What is apply()?

apply() applies a function along an axis of the DataFrame (rows or columns). It's extremely flexible for custom transformations.

Understanding Normalization (Min-Max Scaling)

Formula: (x - min) / (max - min) → converts any range to [0, 1]

Method 1: Normalize single column with apply()

python
# Normalize Salary to 0-1 range
def normalize_series(x):
    """Normalize a single value to 0-1 range"""
    min_val = df['Salary'].min()
    max_val = df['Salary'].max()
    return (x - min_val) / (max_val - min_val)

# Apply to Salary column
df['Salary_Normalized'] = df['Salary'].apply(normalize_series)

print("Normalized Salary (0-1 range):")
print(df[['Name', 'Salary', 'Salary_Normalized']])

Output:

text
    Name  Salary  Salary_Normalized
0   Alice   50000                0.0
1     Bob   60000                0.4
2  Charlie   75000                1.0
3   Diana   55000                0.2
4     Eve   68000                0.72

Method 2: More concise with lambda function

python
# Using lambda (one-liner)
min_sal = df['Salary'].min()
max_sal = df['Salary'].max()

df['Salary_Norm_Lambda'] = df['Salary'].apply(lambda x: (x - min_sal) / (max_sal - min_sal))
print("\nWith lambda function:")
print(df[['Name', 'Salary', 'Salary_Norm_Lambda']])

Method 3: Normalize multiple columns at once

python
# Normalize all numeric columns
def normalize_column(column):
    return (column - column.min()) / (column.max() - column.min())

# Apply to multiple columns
numeric_cols = ['Salary', 'Age', 'Experience', 'Performance']
df_normalized = df.copy()
df_normalized[numeric_cols] = df_normalized[numeric_cols].apply(normalize_column)

print("\nAll columns normalized (0-1 range):")
print(df_normalized[['Name'] + numeric_cols])

Method 4: Row-wise apply (axis=1)

python
# Create a composite score using row-wise operation
def calculate_composite_score(row):
    """Calculate weighted score from multiple columns"""
    salary_weight = 0.3
    experience_weight = 0.4
    performance_weight = 0.3
    
    # Normalize within row context (using global min/max)
    salary_norm = (row['Salary'] - df['Salary'].min()) / (df['Salary'].max() - df['Salary'].min())
    exp_norm = (row['Experience'] - df['Experience'].min()) / (df['Experience'].max() - df['Experience'].min())
    perf_norm = (row['Performance'] - df['Performance'].min()) / (df['Performance'].max() - df['Performance'].min())
    
    return (salary_weight * salary_norm + 
            experience_weight * exp_norm + 
            performance_weight * perf_norm)

df['Composite_Score'] = df.apply(calculate_composite_score, axis=1)
print("\nComposite scores (row-wise apply):")
print(df[['Name', 'Salary', 'Experience', 'Performance', 'Composite_Score']])

Method 5: Vectorized operation (FASTER than apply)

python
# Vectorized normalization (no apply needed - much faster for large data)
min_sal = df['Salary'].min()
max_sal = df['Salary'].max()
df['Salary_Vectorized'] = (df['Salary'] - min_sal) / (max_sal - min_sal)

print("\nVectorized normalization (faster):")
print(df[['Name', 'Salary', 'Salary_Vectorized']])

Performance comparison

python
# Create large dataset for performance test
large_df = pd.DataFrame({'values': np.random.randn(100000)})

# Using apply
%timeit large_df['values'].apply(lambda x: (x - x.min()) / (x.max() - x.min()))

# Using vectorized
%timeit (large_df['values'] - large_df['values'].min()) / (large_df['values'].max() - large_df['values'].min())

# Vectorized is 10-100x faster!

2️⃣ Map categorical values to numeric codes

What is map()?

map() replaces values in a Series using a dictionary or function.

Method 1: Simple mapping with dictionary

python
# Map Department to numeric codes
dept_mapping = {
    'Sales': 1,
    'IT': 2,
    'HR': 3
}

df['Dept_Code'] = df['Department'].map(dept_mapping)

print("Department to Numeric Mapping:")
print(df[['Name', 'Department', 'Dept_Code']])

Method 2: Bidirectional mapping

python
# Create reverse mapping for later use
reverse_mapping = {v: k for k, v in dept_mapping.items()}

# Map performance ratings
performance_mapping = {
    85: 'Good',
    92: 'Excellent',
    78: 'Average',
    88: 'Good',
    95: 'Excellent'
}

df['Performance_Rating'] = df['Performance'].map(performance_mapping)

print("\nPerformance numeric to text:")
print(df[['Name', 'Performance', 'Performance_Rating']])

Method 3: Map with function for complex logic

python
# Map Age to age groups using function
def age_group(age):
    if age < 28:
        return 'Young'
    elif age < 33:
        return 'Mid'
    else:
        return 'Senior'

df['Age_Group'] = df['Age'].map(age_group)

print("\nAge groups:")
print(df[['Name', 'Age', 'Age_Group']])

Method 4: Using replace() for multiple mappings

python
# Replace multiple values at once
df['Salary_Bracket'] = df['Salary'].replace({
    50000: 'Entry',
    55000: 'Junior',
    60000: 'Mid',
    68000: 'Senior',
    75000: 'Expert'
})

print("\nSalary brackets:")
print(df[['Name', 'Salary', 'Salary_Bracket']])

Method 5: Create categorical codes automatically

python
# Convert to categorical and get codes
df['Department_Category'] = pd.Categorical(df['Department'])
df['Dept_Auto_Code'] = df['Department_Category'].cat.codes

print("\nAutomatic categorical coding:")
print(df[['Name', 'Department', 'Dept_Auto_Code']])

Method 6: Map multiple columns at once

python
# Create a more complex mapping example
df_extended = pd.DataFrame({
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'],
    'Quality': ['High', 'Medium', 'Low', 'High'],
    'Region': ['North', 'South', 'East', 'West']
})

# Multiple mappings
quality_score = {'High': 3, 'Medium': 2, 'Low': 1}
region_code = {'North': 1, 'South': 2, 'East': 3, 'West': 4}

df_extended['Quality_Score'] = df_extended['Quality'].map(quality_score)
df_extended['Region_Code'] = df_extended['Region'].map(region_code)

print("\nMultiple column mapping:")
print(df_extended)

3️⃣ Rename multiple columns using dictionary

Method 1: Basic column renaming

python
# Create a fresh copy
df_renamed = df.copy()

# Rename using dictionary
column_mapping = {
    'Name': 'Employee_Name',
    'Dept_Code': 'Department_Code',
    'Performance_Rating': 'Rating'
}

df_renamed = df_renamed.rename(columns=column_mapping)
print("Renamed columns:")
print(df_renamed.columns.tolist())
print("\nFirst few rows:")
print(df_renamed.head())

Method 2: Rename using function (str methods)

python
# Convert all column names to uppercase
df_upper = df.rename(columns=str.upper)
print("\nUppercase columns:")
print(df_upper.columns.tolist())

# Convert to lowercase
df_lower = df.rename(columns=str.lower)
print("\nLowercase columns:")
print(df_lower.columns.tolist())

Method 3: Rename with string replacement

python
# Replace spaces with underscores
df_spaces = pd.DataFrame({
    'First Name': ['Alice', 'Bob'],
    'Last Name': ['Smith', 'Jones'],
    'Annual Salary': [50000, 60000]
})

print("\nOriginal (with spaces):")
print(df_spaces.columns.tolist())

# Remove spaces
df_clean = df_spaces.rename(columns=lambda x: x.replace(' ', '_'))
print("Cleaned:")
print(df_clean.columns.tolist())

Method 4: Rename with pattern matching

python
# Add prefix to all columns
df_prefixed = df.rename(columns=lambda x: f'emp_{x}')
print("\nWith prefix:")
print(df_prefixed.columns.tolist())

# Remove prefix
df_unprefixed = df_prefixed.rename(columns=lambda x: x.replace('emp_', ''))
print("Prefix removed:")
print(df_unprefixed.columns.tolist())

Method 5: Selective renaming with condition

python
# Rename only columns with certain condition
def rename_if_long(col):
    if len(col) > 8:
        return col[:8]  # Truncate long names
    return col

df_truncated = df.rename(columns=rename_if_long)
print("\nTruncated long names:")
print(df_truncated.columns.tolist())

4️⃣ Convert wide format to long using melt

What is melt()?

melt() unpivots data from wide format (many columns) to long format (fewer columns, more rows). Essential for data analysis and visualization.

Create a wide-format dataset

python
# Create sales data in wide format
sales_wide = pd.DataFrame({
    'Product': ['Laptop', 'Mouse', 'Keyboard'],
    'Q1_Sales': [1000, 500, 300],
    'Q2_Sales': [1200, 550, 320],
    'Q3_Sales': [1100, 600, 350],
    'Q4_Sales': [1300, 580, 340]
})

print("Wide Format Data (one row per product):")
print(sales_wide)

Method 1: Basic melt (all columns except identifier)

python
# Melt from wide to long
sales_long = pd.melt(
    sales_wide,
    id_vars=['Product'],  # Columns to keep as identifiers
    var_name='Quarter',    # Name for the variable column
    value_name='Sales'     # Name for the value column
)

print("\nLong Format Data (one row per product-quarter):")
print(sales_long)

Method 2: Melt specific columns

python
# Melt only specific columns
sales_partial = pd.melt(
    sales_wide,
    id_vars=['Product'],
    value_vars=['Q1_Sales', 'Q2_Sales'],  # Only melt these
    var_name='Quarter',
    value_name='Sales'
)

print("\nMelt only Q1 and Q2:")
print(sales_partial)

Method 3: Real-world example - Student scores

python
# Student scores in wide format
scores_wide = pd.DataFrame({
    'Student': ['Alice', 'Bob', 'Charlie'],
    'Math_Score': [85, 92, 78],
    'Science_Score': [88, 89, 82],
    'English_Score': [90, 85, 88]
})

print("\nStudent scores (wide):")
print(scores_wide)

# Melt to long format for easier analysis
scores_long = pd.melt(
    scores_wide,
    id_vars=['Student'],
    var_name='Subject',
    value_name='Score'
)

# Clean up subject names
scores_long['Subject'] = scores_long['Subject'].str.replace('_Score', '')

print("\nStudent scores (long format - better for plotting):")
print(scores_long)

# Now easy to group by subject
print("\nAverage score by subject:")
print(scores_long.groupby('Subject')['Score'].mean())

Method 4: Multiple ID variables

python
# Multi-dimensional data
multi_wide = pd.DataFrame({
    'Year': [2020, 2021, 2022],
    'City': ['NYC', 'LA', 'Chicago'],
    'Product_A': [100, 110, 120],
    'Product_B': [200, 210, 220]
})

print("\nMulti-dimensional wide data:")
print(multi_wide)

# Melt with multiple ID vars
multi_long = pd.melt(
    multi_wide,
    id_vars=['Year', 'City'],
    var_name='Product',
    value_name='Sales'
)

print("\nMelted with multiple IDs:")
print(multi_long)

Method 5: Handling missing values in melt

python
# Data with missing values
wide_with_na = pd.DataFrame({
    'Product': ['A', 'B'],
    'Store1': [100, np.nan],
    'Store2': [150, 200],
    'Store3': [np.nan, 250]
})

print("\nWide with missing values:")
print(wide_with_na)

# Melt and then handle missing
long_with_na = pd.melt(
    wide_with_na,
    id_vars=['Product'],
    var_name='Store',
    value_name='Sales'
)

print("\nLong format with NAs:")
print(long_with_na)

# Remove NA rows
long_clean = long_with_na.dropna()
print("\nAfter removing NAs:")
print(long_clean)

5️⃣ Create pivot table from long format data

What is pivot_table()?

pivot_table() creates a spreadsheet-style pivot table from long-format data. It's the inverse of melt().

Method 1: Basic pivot table

python
# Use the long format data from previous example
print("Original long format:")
print(sales_long.head())

# Pivot back to wide format
sales_pivoted = sales_long.pivot_table(
    index='Product',      # Rows
    columns='Quarter',    # Columns
    values='Sales',       # Values to aggregate
    aggfunc='sum'         # Aggregation function
)

print("\nPivoted back to wide format:")
print(sales_pivoted)

Method 2: Student scores pivot table

python
# Using the scores_long data
print("Original scores_long:")
print(scores_long)

# Create pivot table
score_pivot = scores_long.pivot_table(
    index='Student',
    columns='Subject',
    values='Score'
)

print("\nPivot table - Student scores by subject:")
print(score_pivot)

# Add row and column totals
score_pivot['Average'] = score_pivot.mean(axis=1)
print("\nWith student averages:")
print(score_pivot)

Method 3: Real-world sales analysis

python
# Create realistic sales data
sales_data = pd.DataFrame({
    'Date': pd.date_range('2024-01-01', periods=12, freq='M'),
    'Product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
    'Region': ['North', 'North', 'South', 'South', 'East', 'East', 'West', 'West', 'North', 'South', 'East', 'West'],
    'Sales': [100, 150, 120, 130, 110, 140, 130, 160, 105, 125, 115, 145]
})

print("\nSales data (long format):")
print(sales_data.head(10))

# Pivot by Region and Product
sales_by_region_product = sales_data.pivot_table(
    index='Region',
    columns='Product',
    values='Sales',
    aggfunc='mean'  # Average sales
)

print("\nAverage sales by Region and Product:")
print(sales_by_region_product)

# Pivot with multiple aggregations
sales_multi_agg = sales_data.pivot_table(
    index='Region',
    columns='Product',
    values='Sales',
    aggfunc=['mean', 'sum', 'count']
)

print("\nMultiple aggregations:")
print(sales_multi_agg)

Method 4: Hierarchical pivot tables

python
# Create more complex data
complex_data = pd.DataFrame({
    'Year': [2020, 2020, 2021, 2021, 2020, 2020, 2021, 2021],
    'Quarter': ['Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2'],
    'Product': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],
    'Region': ['North', 'North', 'South', 'South', 'East', 'East', 'West', 'West'],
    'Revenue': [100, 110, 120, 130, 140, 150, 160, 170]
})

print("\nComplex hierarchical data:")
print(complex_data)

# Multi-level pivot
complex_pivot = complex_data.pivot_table(
    index=['Year', 'Quarter'],
    columns=['Product', 'Region'],
    values='Revenue',
    aggfunc='sum'
)

print("\nHierarchical pivot table:")
print(complex_pivot)

Method 5: Pivot with margins (totals)

python
# Add total rows and columns
pivot_with_margins = sales_data.pivot_table(
    index='Region',
    columns='Product',
    values='Sales',
    aggfunc='sum',
    margins=True,           # Add totals
    margins_name='Total'    # Name for total row/col
)

print("\nPivot table with totals:")
print(pivot_with_margins)

Method 6: Fill missing values in pivot

python
# Data with missing combinations
incomplete_data = pd.DataFrame({
    'Product': ['A', 'A', 'B', 'C', 'C'],
    'Store': ['X', 'Y', 'X', 'X', 'Z'],
    'Sales': [100, 150, 200, 120, 180]
})

print("\nIncomplete data:")
print(incomplete_data)

# Pivot with fill value
pivot_filled = incomplete_data.pivot_table(
    index='Product',
    columns='Store',
    values='Sales',
    aggfunc='sum',
    fill_value=0  # Fill missing with 0
)

print("\nPivot with missing filled:")
print(pivot_filled)

🎯 Complete Hands-on Checklist

python
import pandas as pd
import numpy as np

# Create comprehensive dataset
df = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Department': ['Sales', 'IT', 'Sales', 'HR', 'IT'],
    'Base_Salary': [50000, 60000, 75000, 55000, 68000],
    'Bonus': [5000, 6000, 7500, 5500, 6800],
    'Experience': [2, 5, 8, 3, 6],
    'Performance': ['A', 'B', 'C', 'B', 'A']
})

print("="*60)
print("1. APPLY - Normalize Base_Salary to 0-1")
print("="*60)
min_sal = df['Base_Salary'].min()
max_sal = df['Base_Salary'].max()
df['Salary_Norm'] = df['Base_Salary'].apply(lambda x: (x - min_sal) / (max_sal - min_sal))
print(df[['Employee', 'Base_Salary', 'Salary_Norm']])

print("\n" + "="*60)
print("2. MAP - Convert Performance to numeric codes")
print("="*60)
perf_mapping = {'A': 5, 'B': 3, 'C': 1}
df['Performance_Code'] = df['Performance'].map(perf_mapping)
print(df[['Employee', 'Performance', 'Performance_Code']])

print("\n" + "="*60)
print("3. RENAME - Multiple columns")
print("="*60)
df_renamed = df.rename(columns={
    'Employee': 'Emp_Name',
    'Base_Salary': 'Salary',
    'Performance_Code': 'Perf_Score'
})
print("New column names:", df_renamed.columns.tolist())

print("\n" + "="*60)
print("4. MELT - Convert wide to long")
print("="*60)
salary_wide = df[['Employee', 'Base_Salary', 'Bonus']]
salary_long = pd.melt(
    salary_wide,
    id_vars=['Employee'],
    var_name='Compensation_Type',
    value_name='Amount'
)
print(salary_long)

print("\n" + "="*60)
print("5. PIVOT_TABLE - Create summary")
print("="*60)
pivot_summary = df.pivot_table(
    index='Department',
    columns='Performance',
    values='Base_Salary',
    aggfunc='mean'
)
print(pivot_summary)

❌ Common Mistakes & Solutions

MistakeWhy it failsCorrect way
Using apply() when vectorized worksVery slow for large dataUse vectorized operations when possible
Forgetting axis=1 for row operationsApplies to columns by defaultdf.apply(func, axis=1) for rows
map() returns NaN for unmapped valuesMissing keys return NaNUse fillna() after mapping or ensure all keys mapped
melt() without id_varsMelts everythingSpecify which columns to keep as identifiers
Pivot without aggfuncError with duplicate entriesAlways specify aggfunc (usually 'sum' or 'mean')

📝 Practice Exercises

  1. Create a function to normalize Age to 0-1 range using apply()

  2. Map Education levels ['High School', 'Bachelor', 'Master', 'PhD'] to [1,2,3,4]

  3. Rename columns to snake_case from CamelCase

  4. Convert this wide data to long:

    python
    data = {'Student': ['A','B'], 'Math':[90,85], 'Science':[88,92]}
  5. Create a pivot table showing average sales by Region and Product category


✅ Summary of Day 4

OperationMethodKey Parameters
Apply functiondf.apply(func)axis=0/1
Map valuesseries.map(dict)Dictionary mapping
Rename columnsdf.rename(columns=dict)columns, inplace
Wide to longpd.melt()id_vars, var_name, value_name
Long to widedf.pivot_table()index, columns, values, aggfunc

🎓 Pro Tips

  1. Prefer vectorized operations over apply() for simple math (10-100x faster)

  2. Chain transformations for clean code: df.rename().pipe(another_func)

  3. Use pipe() for complex transformation pipelines

  4. Always verify your melt/pivot operations with small test data first

  5. pd.cut() is better than custom mapping for binning continuous data

Discussion (0)
Login to comment
Dictionary

Add New Word

Dictionary Words
My Notes
Highlights
Select text and click highlight to save
My Vocabulary
Quick Quiz
Settings
Reading Analytics
Today's reading: 0 min
Total read time: 0 min
Words learned: 0
Streak: 0 days
AI Summary

Generating summary...