📌 Day 4: Data Transformation - Apply, Map & Reshape
Prerequisite Setup
import pandas as pd import numpy as np # Create sample datasets for all examples np.random.seed(42) # Main dataset for apply() and map() df = pd.DataFrame({ 'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'], 'Department': ['Sales', 'IT', 'Sales', 'HR', 'IT'], 'Salary': [50000, 60000, 75000, 55000, 68000], 'Age': [25, 30, 35, 28, 32], 'Experience': [2, 5, 8, 3, 6], 'Performance': [85, 92, 78, 88, 95] }) print("Original DataFrame:") print(df) print("\n" + "="*60)
1️⃣ Use apply() to normalize a column to 0-1 range
What is apply()?
apply() applies a function along an axis of the DataFrame (rows or columns). It's extremely flexible for custom transformations.
Understanding Normalization (Min-Max Scaling)
Formula: (x - min) / (max - min) → converts any range to [0, 1]
Method 1: Normalize single column with apply()
# Normalize Salary to 0-1 range def normalize_series(x): """Normalize a single value to 0-1 range""" min_val = df['Salary'].min() max_val = df['Salary'].max() return (x - min_val) / (max_val - min_val) # Apply to Salary column df['Salary_Normalized'] = df['Salary'].apply(normalize_series) print("Normalized Salary (0-1 range):") print(df[['Name', 'Salary', 'Salary_Normalized']])
Output:
Name Salary Salary_Normalized 0 Alice 50000 0.0 1 Bob 60000 0.4 2 Charlie 75000 1.0 3 Diana 55000 0.2 4 Eve 68000 0.72
Method 2: More concise with lambda function
# Using lambda (one-liner) min_sal = df['Salary'].min() max_sal = df['Salary'].max() df['Salary_Norm_Lambda'] = df['Salary'].apply(lambda x: (x - min_sal) / (max_sal - min_sal)) print("\nWith lambda function:") print(df[['Name', 'Salary', 'Salary_Norm_Lambda']])
Method 3: Normalize multiple columns at once
# Normalize all numeric columns def normalize_column(column): return (column - column.min()) / (column.max() - column.min()) # Apply to multiple columns numeric_cols = ['Salary', 'Age', 'Experience', 'Performance'] df_normalized = df.copy() df_normalized[numeric_cols] = df_normalized[numeric_cols].apply(normalize_column) print("\nAll columns normalized (0-1 range):") print(df_normalized[['Name'] + numeric_cols])
Method 4: Row-wise apply (axis=1)
# Create a composite score using row-wise operation def calculate_composite_score(row): """Calculate weighted score from multiple columns""" salary_weight = 0.3 experience_weight = 0.4 performance_weight = 0.3 # Normalize within row context (using global min/max) salary_norm = (row['Salary'] - df['Salary'].min()) / (df['Salary'].max() - df['Salary'].min()) exp_norm = (row['Experience'] - df['Experience'].min()) / (df['Experience'].max() - df['Experience'].min()) perf_norm = (row['Performance'] - df['Performance'].min()) / (df['Performance'].max() - df['Performance'].min()) return (salary_weight * salary_norm + experience_weight * exp_norm + performance_weight * perf_norm) df['Composite_Score'] = df.apply(calculate_composite_score, axis=1) print("\nComposite scores (row-wise apply):") print(df[['Name', 'Salary', 'Experience', 'Performance', 'Composite_Score']])
Method 5: Vectorized operation (FASTER than apply)
# Vectorized normalization (no apply needed - much faster for large data) min_sal = df['Salary'].min() max_sal = df['Salary'].max() df['Salary_Vectorized'] = (df['Salary'] - min_sal) / (max_sal - min_sal) print("\nVectorized normalization (faster):") print(df[['Name', 'Salary', 'Salary_Vectorized']])
Performance comparison
# Create large dataset for performance test large_df = pd.DataFrame({'values': np.random.randn(100000)}) # Using apply %timeit large_df['values'].apply(lambda x: (x - x.min()) / (x.max() - x.min())) # Using vectorized %timeit (large_df['values'] - large_df['values'].min()) / (large_df['values'].max() - large_df['values'].min()) # Vectorized is 10-100x faster!
2️⃣ Map categorical values to numeric codes
What is map()?
map() replaces values in a Series using a dictionary or function.
Method 1: Simple mapping with dictionary
# Map Department to numeric codes dept_mapping = { 'Sales': 1, 'IT': 2, 'HR': 3 } df['Dept_Code'] = df['Department'].map(dept_mapping) print("Department to Numeric Mapping:") print(df[['Name', 'Department', 'Dept_Code']])
Method 2: Bidirectional mapping
# Create reverse mapping for later use reverse_mapping = {v: k for k, v in dept_mapping.items()} # Map performance ratings performance_mapping = { 85: 'Good', 92: 'Excellent', 78: 'Average', 88: 'Good', 95: 'Excellent' } df['Performance_Rating'] = df['Performance'].map(performance_mapping) print("\nPerformance numeric to text:") print(df[['Name', 'Performance', 'Performance_Rating']])
Method 3: Map with function for complex logic
# Map Age to age groups using function def age_group(age): if age < 28: return 'Young' elif age < 33: return 'Mid' else: return 'Senior' df['Age_Group'] = df['Age'].map(age_group) print("\nAge groups:") print(df[['Name', 'Age', 'Age_Group']])
Method 4: Using replace() for multiple mappings
# Replace multiple values at once df['Salary_Bracket'] = df['Salary'].replace({ 50000: 'Entry', 55000: 'Junior', 60000: 'Mid', 68000: 'Senior', 75000: 'Expert' }) print("\nSalary brackets:") print(df[['Name', 'Salary', 'Salary_Bracket']])
Method 5: Create categorical codes automatically
# Convert to categorical and get codes df['Department_Category'] = pd.Categorical(df['Department']) df['Dept_Auto_Code'] = df['Department_Category'].cat.codes print("\nAutomatic categorical coding:") print(df[['Name', 'Department', 'Dept_Auto_Code']])
Method 6: Map multiple columns at once
# Create a more complex mapping example df_extended = pd.DataFrame({ 'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor'], 'Quality': ['High', 'Medium', 'Low', 'High'], 'Region': ['North', 'South', 'East', 'West'] }) # Multiple mappings quality_score = {'High': 3, 'Medium': 2, 'Low': 1} region_code = {'North': 1, 'South': 2, 'East': 3, 'West': 4} df_extended['Quality_Score'] = df_extended['Quality'].map(quality_score) df_extended['Region_Code'] = df_extended['Region'].map(region_code) print("\nMultiple column mapping:") print(df_extended)
3️⃣ Rename multiple columns using dictionary
Method 1: Basic column renaming
# Create a fresh copy df_renamed = df.copy() # Rename using dictionary column_mapping = { 'Name': 'Employee_Name', 'Dept_Code': 'Department_Code', 'Performance_Rating': 'Rating' } df_renamed = df_renamed.rename(columns=column_mapping) print("Renamed columns:") print(df_renamed.columns.tolist()) print("\nFirst few rows:") print(df_renamed.head())
Method 2: Rename using function (str methods)
# Convert all column names to uppercase df_upper = df.rename(columns=str.upper) print("\nUppercase columns:") print(df_upper.columns.tolist()) # Convert to lowercase df_lower = df.rename(columns=str.lower) print("\nLowercase columns:") print(df_lower.columns.tolist())
Method 3: Rename with string replacement
# Replace spaces with underscores df_spaces = pd.DataFrame({ 'First Name': ['Alice', 'Bob'], 'Last Name': ['Smith', 'Jones'], 'Annual Salary': [50000, 60000] }) print("\nOriginal (with spaces):") print(df_spaces.columns.tolist()) # Remove spaces df_clean = df_spaces.rename(columns=lambda x: x.replace(' ', '_')) print("Cleaned:") print(df_clean.columns.tolist())
Method 4: Rename with pattern matching
# Add prefix to all columns df_prefixed = df.rename(columns=lambda x: f'emp_{x}') print("\nWith prefix:") print(df_prefixed.columns.tolist()) # Remove prefix df_unprefixed = df_prefixed.rename(columns=lambda x: x.replace('emp_', '')) print("Prefix removed:") print(df_unprefixed.columns.tolist())
Method 5: Selective renaming with condition
# Rename only columns with certain condition def rename_if_long(col): if len(col) > 8: return col[:8] # Truncate long names return col df_truncated = df.rename(columns=rename_if_long) print("\nTruncated long names:") print(df_truncated.columns.tolist())
4️⃣ Convert wide format to long using melt
What is melt()?
melt() unpivots data from wide format (many columns) to long format (fewer columns, more rows). Essential for data analysis and visualization.
Create a wide-format dataset
# Create sales data in wide format sales_wide = pd.DataFrame({ 'Product': ['Laptop', 'Mouse', 'Keyboard'], 'Q1_Sales': [1000, 500, 300], 'Q2_Sales': [1200, 550, 320], 'Q3_Sales': [1100, 600, 350], 'Q4_Sales': [1300, 580, 340] }) print("Wide Format Data (one row per product):") print(sales_wide)
Method 1: Basic melt (all columns except identifier)
# Melt from wide to long sales_long = pd.melt( sales_wide, id_vars=['Product'], # Columns to keep as identifiers var_name='Quarter', # Name for the variable column value_name='Sales' # Name for the value column ) print("\nLong Format Data (one row per product-quarter):") print(sales_long)
Method 2: Melt specific columns
# Melt only specific columns sales_partial = pd.melt( sales_wide, id_vars=['Product'], value_vars=['Q1_Sales', 'Q2_Sales'], # Only melt these var_name='Quarter', value_name='Sales' ) print("\nMelt only Q1 and Q2:") print(sales_partial)
Method 3: Real-world example - Student scores
# Student scores in wide format scores_wide = pd.DataFrame({ 'Student': ['Alice', 'Bob', 'Charlie'], 'Math_Score': [85, 92, 78], 'Science_Score': [88, 89, 82], 'English_Score': [90, 85, 88] }) print("\nStudent scores (wide):") print(scores_wide) # Melt to long format for easier analysis scores_long = pd.melt( scores_wide, id_vars=['Student'], var_name='Subject', value_name='Score' ) # Clean up subject names scores_long['Subject'] = scores_long['Subject'].str.replace('_Score', '') print("\nStudent scores (long format - better for plotting):") print(scores_long) # Now easy to group by subject print("\nAverage score by subject:") print(scores_long.groupby('Subject')['Score'].mean())
Method 4: Multiple ID variables
# Multi-dimensional data multi_wide = pd.DataFrame({ 'Year': [2020, 2021, 2022], 'City': ['NYC', 'LA', 'Chicago'], 'Product_A': [100, 110, 120], 'Product_B': [200, 210, 220] }) print("\nMulti-dimensional wide data:") print(multi_wide) # Melt with multiple ID vars multi_long = pd.melt( multi_wide, id_vars=['Year', 'City'], var_name='Product', value_name='Sales' ) print("\nMelted with multiple IDs:") print(multi_long)
Method 5: Handling missing values in melt
# Data with missing values wide_with_na = pd.DataFrame({ 'Product': ['A', 'B'], 'Store1': [100, np.nan], 'Store2': [150, 200], 'Store3': [np.nan, 250] }) print("\nWide with missing values:") print(wide_with_na) # Melt and then handle missing long_with_na = pd.melt( wide_with_na, id_vars=['Product'], var_name='Store', value_name='Sales' ) print("\nLong format with NAs:") print(long_with_na) # Remove NA rows long_clean = long_with_na.dropna() print("\nAfter removing NAs:") print(long_clean)
5️⃣ Create pivot table from long format data
What is pivot_table()?
pivot_table() creates a spreadsheet-style pivot table from long-format data. It's the inverse of melt().
Method 1: Basic pivot table
# Use the long format data from previous example print("Original long format:") print(sales_long.head()) # Pivot back to wide format sales_pivoted = sales_long.pivot_table( index='Product', # Rows columns='Quarter', # Columns values='Sales', # Values to aggregate aggfunc='sum' # Aggregation function ) print("\nPivoted back to wide format:") print(sales_pivoted)
Method 2: Student scores pivot table
# Using the scores_long data print("Original scores_long:") print(scores_long) # Create pivot table score_pivot = scores_long.pivot_table( index='Student', columns='Subject', values='Score' ) print("\nPivot table - Student scores by subject:") print(score_pivot) # Add row and column totals score_pivot['Average'] = score_pivot.mean(axis=1) print("\nWith student averages:") print(score_pivot)
Method 3: Real-world sales analysis
# Create realistic sales data sales_data = pd.DataFrame({ 'Date': pd.date_range('2024-01-01', periods=12, freq='M'), 'Product': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'], 'Region': ['North', 'North', 'South', 'South', 'East', 'East', 'West', 'West', 'North', 'South', 'East', 'West'], 'Sales': [100, 150, 120, 130, 110, 140, 130, 160, 105, 125, 115, 145] }) print("\nSales data (long format):") print(sales_data.head(10)) # Pivot by Region and Product sales_by_region_product = sales_data.pivot_table( index='Region', columns='Product', values='Sales', aggfunc='mean' # Average sales ) print("\nAverage sales by Region and Product:") print(sales_by_region_product) # Pivot with multiple aggregations sales_multi_agg = sales_data.pivot_table( index='Region', columns='Product', values='Sales', aggfunc=['mean', 'sum', 'count'] ) print("\nMultiple aggregations:") print(sales_multi_agg)
Method 4: Hierarchical pivot tables
# Create more complex data complex_data = pd.DataFrame({ 'Year': [2020, 2020, 2021, 2021, 2020, 2020, 2021, 2021], 'Quarter': ['Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2', 'Q1', 'Q2'], 'Product': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'], 'Region': ['North', 'North', 'South', 'South', 'East', 'East', 'West', 'West'], 'Revenue': [100, 110, 120, 130, 140, 150, 160, 170] }) print("\nComplex hierarchical data:") print(complex_data) # Multi-level pivot complex_pivot = complex_data.pivot_table( index=['Year', 'Quarter'], columns=['Product', 'Region'], values='Revenue', aggfunc='sum' ) print("\nHierarchical pivot table:") print(complex_pivot)
Method 5: Pivot with margins (totals)
# Add total rows and columns pivot_with_margins = sales_data.pivot_table( index='Region', columns='Product', values='Sales', aggfunc='sum', margins=True, # Add totals margins_name='Total' # Name for total row/col ) print("\nPivot table with totals:") print(pivot_with_margins)
Method 6: Fill missing values in pivot
# Data with missing combinations incomplete_data = pd.DataFrame({ 'Product': ['A', 'A', 'B', 'C', 'C'], 'Store': ['X', 'Y', 'X', 'X', 'Z'], 'Sales': [100, 150, 200, 120, 180] }) print("\nIncomplete data:") print(incomplete_data) # Pivot with fill value pivot_filled = incomplete_data.pivot_table( index='Product', columns='Store', values='Sales', aggfunc='sum', fill_value=0 # Fill missing with 0 ) print("\nPivot with missing filled:") print(pivot_filled)
🎯 Complete Hands-on Checklist
import pandas as pd import numpy as np # Create comprehensive dataset df = pd.DataFrame({ 'Employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'], 'Department': ['Sales', 'IT', 'Sales', 'HR', 'IT'], 'Base_Salary': [50000, 60000, 75000, 55000, 68000], 'Bonus': [5000, 6000, 7500, 5500, 6800], 'Experience': [2, 5, 8, 3, 6], 'Performance': ['A', 'B', 'C', 'B', 'A'] }) print("="*60) print("1. APPLY - Normalize Base_Salary to 0-1") print("="*60) min_sal = df['Base_Salary'].min() max_sal = df['Base_Salary'].max() df['Salary_Norm'] = df['Base_Salary'].apply(lambda x: (x - min_sal) / (max_sal - min_sal)) print(df[['Employee', 'Base_Salary', 'Salary_Norm']]) print("\n" + "="*60) print("2. MAP - Convert Performance to numeric codes") print("="*60) perf_mapping = {'A': 5, 'B': 3, 'C': 1} df['Performance_Code'] = df['Performance'].map(perf_mapping) print(df[['Employee', 'Performance', 'Performance_Code']]) print("\n" + "="*60) print("3. RENAME - Multiple columns") print("="*60) df_renamed = df.rename(columns={ 'Employee': 'Emp_Name', 'Base_Salary': 'Salary', 'Performance_Code': 'Perf_Score' }) print("New column names:", df_renamed.columns.tolist()) print("\n" + "="*60) print("4. MELT - Convert wide to long") print("="*60) salary_wide = df[['Employee', 'Base_Salary', 'Bonus']] salary_long = pd.melt( salary_wide, id_vars=['Employee'], var_name='Compensation_Type', value_name='Amount' ) print(salary_long) print("\n" + "="*60) print("5. PIVOT_TABLE - Create summary") print("="*60) pivot_summary = df.pivot_table( index='Department', columns='Performance', values='Base_Salary', aggfunc='mean' ) print(pivot_summary)
❌ Common Mistakes & Solutions
| Mistake | Why it fails | Correct way |
|---|---|---|
Using apply() when vectorized works | Very slow for large data | Use vectorized operations when possible |
Forgetting axis=1 for row operations | Applies to columns by default | df.apply(func, axis=1) for rows |
map() returns NaN for unmapped values | Missing keys return NaN | Use fillna() after mapping or ensure all keys mapped |
melt() without id_vars | Melts everything | Specify which columns to keep as identifiers |
Pivot without aggfunc | Error with duplicate entries | Always specify aggfunc (usually 'sum' or 'mean') |
📝 Practice Exercises
Create a function to normalize Age to 0-1 range using
apply()Map Education levels ['High School', 'Bachelor', 'Master', 'PhD'] to [1,2,3,4]
Rename columns to snake_case from CamelCase
Convert this wide data to long:
pythondata = {'Student': ['A','B'], 'Math':[90,85], 'Science':[88,92]}
Create a pivot table showing average sales by Region and Product category
✅ Summary of Day 4
| Operation | Method | Key Parameters |
|---|---|---|
| Apply function | df.apply(func) | axis=0/1 |
| Map values | series.map(dict) | Dictionary mapping |
| Rename columns | df.rename(columns=dict) | columns, inplace |
| Wide to long | pd.melt() | id_vars, var_name, value_name |
| Long to wide | df.pivot_table() | index, columns, values, aggfunc |
🎓 Pro Tips
Prefer vectorized operations over
apply()for simple math (10-100x faster)Chain transformations for clean code:
df.rename().pipe(another_func)Use
pipe()for complex transformation pipelinesAlways verify your melt/pivot operations with small test data first
pd.cut()is better than custom mapping for binning continuous data
Discussion (0)