π Day 6: Merging, Joining & Concatenation
Prerequisite Setup
import pandas as pd import numpy as np # Create multiple datasets for different scenarios np.random.seed(42) # Dataset 1: Employee information employees = pd.DataFrame({ 'emp_id': [1, 2, 3, 4, 5, 6], 'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'], 'department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'Marketing'], 'salary': [50000, 60000, 75000, 55000, 68000, 62000] }) # Dataset 2: Employee contact info contact = pd.DataFrame({ 'emp_id': [1, 2, 3, 4, 5, 7], # Note: emp_id 7 doesn't exist in employees 'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com', 'diana@email.com', 'eve@email.com', 'frank@email.com'], 'phone': ['123-4567', '234-5678', '345-6789', '456-7890', '567-8901', '678-9012'] }) # Dataset 3: Performance scores (different key name) performance = pd.DataFrame({ 'employee_id': [1, 2, 3, 4, 5, 6], 'score': [85, 92, 78, 88, 95, 82], 'rating': ['B', 'A', 'C', 'B', 'A', 'B'] }) # Dataset 4: Additional employees for concatenation new_employees = pd.DataFrame({ 'emp_id': [7, 8], 'name': ['Grace', 'Henry'], 'department': ['HR', 'IT'], 'salary': [59000, 71000] }) # Dataset 5: Project assignments (multiple keys) projects = pd.DataFrame({ 'emp_id': [1, 1, 2, 3, 4, 5, 5, 6], 'project': ['Alpha', 'Beta', 'Gamma', 'Alpha', 'Delta', 'Gamma', 'Epsilon', 'Beta'], 'hours': [40, 10, 35, 45, 30, 25, 15, 40] }) print("Employees Dataset:") print(employees) print("\n" + "="*60) print("Contact Dataset:") print(contact) print("\n" + "="*60) print("Performance Dataset:") print(performance) print("\n" + "="*60) print("New Employees Dataset:") print(new_employees) print("\n" + "="*60) print("Projects Dataset:") print(projects.head()) print("\n" + "="*60)
1οΈβ£ Vertically stack two DataFrames with concat
What is vertical concatenation?
Stacking rows from multiple DataFrames on top of each other (like SQL UNION ALL).
Basic vertical concatenation
# Stack employees and new_employees vertically all_employees = pd.concat([employees, new_employees], ignore_index=True) print("All Employees (vertically stacked):") print(all_employees) print(f"\nOriginal rows: {len(employees)} + {len(new_employees)} = {len(all_employees)}")
Handling different column names
# Create DataFrames with different columns df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}) df2 = pd.DataFrame({'A': [5, 6], 'C': [7, 8]}) # Note: different columns print("\nDataFrame 1:") print(df1) print("\nDataFrame 2:") print(df2) # Default concat (fills missing with NaN) combined_default = pd.concat([df1, df2], ignore_index=True) print("\nDefault concat (fills missing with NaN):") print(combined_default) # Use join='inner' to keep only common columns combined_inner = pd.concat([df1, df2], ignore_index=True, join='inner') print("\nInner join (only common columns):") print(combined_inner)
Adding keys to identify source
# Add keys to know which DataFrame each row came from employees_with_key = pd.concat( [employees, new_employees], keys=['Original', 'New'], names=['Source'] ) print("\nConcatenation with keys:") print(employees_with_key) # Reset index to make it cleaner employees_with_key_reset = employees_with_key.reset_index(level=0) print("\nWith reset index:") print(employees_with_key_reset)
Handling index conflicts
# Create DataFrames with overlapping indices df_a = pd.DataFrame({'X': [1, 2]}, index=['a', 'b']) df_b = pd.DataFrame({'X': [3, 4]}, index=['a', 'c']) print("\nDataFrame A:") print(df_a) print("\nDataFrame B:") print(df_b) # Default - keeps all indices (may create duplicates) concat_default = pd.concat([df_a, df_b]) print("\nDefault concat (allows duplicate indices):") print(concat_default) # Verify if index is unique concat_verify = pd.concat([df_a, df_b], verify_integrity=True) # This would raise an error if indices overlap print("\nWith verify_integrity=True (would error on overlap):") # Uncomment to see error: pd.concat([df_a, df_b], verify_integrity=True)
Real-world example: Monthly sales data
# Sales data for different months jan_sales = pd.DataFrame({ 'date': pd.date_range('2024-01-01', periods=5, freq='D'), 'product': ['A', 'B', 'A', 'C', 'B'], 'sales': [100, 150, 120, 130, 110] }) feb_sales = pd.DataFrame({ 'date': pd.date_range('2024-02-01', periods=5, freq='D'), 'product': ['A', 'B', 'C', 'A', 'B'], 'sales': [110, 140, 125, 115, 135] }) # Stack monthly data all_sales = pd.concat([jan_sales, feb_sales], ignore_index=True) print("\nAll Sales Data (Jan + Feb):") print(all_sales) # Add month column using keys all_sales_with_month = pd.concat( [jan_sales, feb_sales], keys=['January', 'February'], names=['Month'] ).reset_index(level=0) print("\nSales with month identifier:") print(all_sales_with_month.head())
2οΈβ£ Horizontally concatenate along columns
Basic horizontal concatenation
# Create two DataFrames with same number of rows df_left = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']}) df_right = pd.DataFrame({'age': [25, 30, 35], 'city': ['NYC', 'LA', 'Chicago']}) print("Left DataFrame:") print(df_left) print("\nRight DataFrame:") print(df_right) # Horizontal concatenation (axis=1) combined_horizontal = pd.concat([df_left, df_right], axis=1) print("\nHorizontally concatenated:") print(combined_horizontal)
Handling different row counts
# DataFrames with different number of rows df_a = pd.DataFrame({'A': [1, 2, 3]}) df_b = pd.DataFrame({'B': [4, 5]}) # Only 2 rows print("\nDataFrame A (3 rows):") print(df_a) print("\nDataFrame B (2 rows):") print(df_b) # Default (outer join) - fills missing with NaN concat_horiz_outer = pd.concat([df_a, df_b], axis=1) print("\nHorizontal concat (outer join - default):") print(concat_horiz_outer) # Inner join - only keeps common rows concat_horiz_inner = pd.concat([df_a, df_b], axis=1, join='inner') print("\nHorizontal concat (inner join):") print(concat_horiz_inner)
Aligning by index
# Create DataFrames with meaningful indices sales_q1 = pd.DataFrame({ 'revenue': [1000, 1200, 1100], 'cost': [600, 700, 650] }, index=['Product A', 'Product B', 'Product C']) sales_q2 = pd.DataFrame({ 'revenue': [1150, 1300, 1050], 'cost': [650, 750, 620] }, index=['Product A', 'Product B', 'Product D']) # Different product print("\nQ1 Sales:") print(sales_q1) print("\nQ2 Sales:") print(sales_q2) # Horizontal concat aligns by index combined_quarters = pd.concat([sales_q1, sales_q2], axis=1, keys=['Q1', 'Q2']) print("\nCombined Quarters (aligned by product):") print(combined_quarters) # Calculate growth where data exists combined_quarters['Growth'] = ( combined_quarters[('Q2', 'revenue')] - combined_quarters[('Q1', 'revenue')] ) / combined_quarters[('Q1', 'revenue')] * 100 print("\nWith growth calculation:") print(combined_quarters)
Real-world: Combining features and target
# Features DataFrame features = pd.DataFrame({ 'customer_id': [1, 2, 3, 4, 5], 'age': [25, 30, 35, 28, 32], 'income': [50000, 60000, 75000, 55000, 68000], 'purchases': [10, 15, 8, 12, 20] }) # Target variable (what we want to predict) target = pd.DataFrame({ 'customer_id': [1, 2, 3, 4, 5], 'churned': [0, 1, 0, 0, 1] # 1 = churned, 0 = stayed }) print("\nFeatures (X):") print(features) print("\nTarget (y):") print(target) # Combine for modeling (horizontal concat) model_data = pd.concat([features, target[['churned']]], axis=1) print("\nCombined for modeling:") print(model_data)
3οΈβ£ Perform inner join on common key column
What is an Inner Join?
Returns only rows where keys match in BOTH DataFrames.
Basic inner join with merge()
# Inner join employees and contact on emp_id inner_join_result = pd.merge(employees, contact, on='emp_id', how='inner') print("Inner Join (employees β© contact):") print(inner_join_result) print(f"\nRows in result: {len(inner_join_result)}") print(f"Note: emp_id 6 and 7 are missing (no match in both)")
Inner join with different column names
# Join employees (emp_id) with performance (employee_id) inner_diff_keys = pd.merge( employees, performance, left_on='emp_id', right_on='employee_id', how='inner' ) print("\nInner join with different key names:") print(inner_diff_keys) # Drop the duplicate key column inner_diff_keys_clean = inner_diff_keys.drop('employee_id', axis=1) print("\nCleaned (removed duplicate key):") print(inner_diff_keys_clean)
Inner join with multiple matching keys
# Create department managers dataset dept_managers = pd.DataFrame({ 'department': ['Sales', 'IT', 'HR', 'Marketing'], 'manager': ['John', 'Sarah', 'Mike', 'Lisa'], 'location': ['Floor 1', 'Floor 2', 'Floor 1', 'Floor 3'] }) print("\nDepartment Managers:") print(dept_managers) # Inner join employees with managers on department emp_with_manager = pd.merge(employees, dept_managers, on='department', how='inner') print("\nEmployees with managers (inner join):") print(emp_with_manager)
Visualizing joins with small data
# Simple example to understand inner join df1 = pd.DataFrame({ 'key': ['A', 'B', 'C', 'D'], 'value1': [1, 2, 3, 4] }) df2 = pd.DataFrame({ 'key': ['B', 'C', 'E', 'F'], 'value2': [5, 6, 7, 8] }) print("\nDataFrame 1:") print(df1) print("\nDataFrame 2:") print(df2) # Inner join (only B and C appear in both) inner_demo = pd.merge(df1, df2, on='key', how='inner') print("\nInner Join Result (only matching keys B & C):") print(inner_demo)
4οΈβ£ Do left join and observe missing values
What is a Left Join?
Keeps ALL rows from left DataFrame, matches from right where available (fills NaN for missing).
Basic left join
# Left join employees (left) with contact (right) left_join_result = pd.merge(employees, contact, on='emp_id', how='left') print("Left Join (all employees, contact info where available):") print(left_join_result) print(f"\nMissing emails: {left_join_result['email'].isnull().sum()}")
Observing missing values
# Identify which employees have missing contact info missing_contact = left_join_result[left_join_result['email'].isnull()] print("\nEmployees missing contact info:") print(missing_contact[['emp_id', 'name', 'department']]) # Count missing values per column print(f"\nMissing values summary:") print(left_join_result[['email', 'phone']].isnull().sum())
Right join (opposite of left join)
# Right join (all contact records, employee info where available) right_join_result = pd.merge(employees, contact, on='emp_id', how='right') print("\nRight Join (all contact records):") print(right_join_result) # Note: emp_id 7 has no employee info
Left join with different column names
# Left join employees with performance (different column names) left_join_diff = pd.merge( employees, performance, left_on='emp_id', right_on='employee_id', how='left' ) print("\nLeft join with different key names:") print(left_join_diff) # Notice employee_id column exists, can drop it left_join_clean = left_join_diff.drop('employee_id', axis=1) print("\nCleaned left join:") print(left_join_clean)
Handling missing values after left join
# Fill missing performance scores with default value left_join_clean['score'] = left_join_clean['score'].fillna(0) left_join_clean['rating'] = left_join_clean['rating'].fillna('No Rating') print("\nAfter filling missing values:") print(left_join_clean[['name', 'score', 'rating']]) # Or flag missing values left_join_clean['has_performance'] = ~left_join_clean['score'].isnull()
5οΈβ£ Merge on multiple columns with different names
Merging on multiple columns
# Create datasets with composite keys sales_targets = pd.DataFrame({ 'year': [2023, 2023, 2024, 2024], 'quarter': [1, 2, 1, 2], 'target': [100000, 110000, 120000, 130000] }) actual_sales = pd.DataFrame({ 'year': [2023, 2023, 2024, 2024, 2024], 'quarter': [1, 2, 1, 2, 2], 'region': ['North', 'South', 'East', 'West', 'North'], 'sales': [95000, 105000, 125000, 128000, 102000] }) print("Sales Targets (multiple keys):") print(sales_targets) print("\nActual Sales:") print(actual_sales) # Merge on multiple columns (year and quarter) sales_comparison = pd.merge( actual_sales, sales_targets, on=['year', 'quarter'], how='left' ) print("\nSales vs Targets (merged on year & quarter):") print(sales_comparison) # Calculate performance sales_comparison['performance_vs_target'] = ( (sales_comparison['sales'] - sales_comparison['target']) / sales_comparison['target'] * 100 ).round(2) print("\nWith performance calculation:") print(sales_comparison)
Merging with different column names on multiple keys
# Dataset with different column names employee_details = pd.DataFrame({ 'emp_id': [1, 2, 3, 4, 5], 'emp_name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'], 'dept_code': ['SAL', 'IT', 'SAL', 'HR', 'IT'] }) department_info = pd.DataFrame({ 'dept_name': ['Sales', 'IT', 'HR', 'Marketing'], 'dept_code': ['SAL', 'IT', 'HR', 'MKT'], 'floor': [1, 2, 1, 3], 'manager': ['John', 'Sarah', 'Mike', 'Lisa'] }) print("\nEmployee Details:") print(employee_details) print("\nDepartment Info:") print(department_info) # Merge on dept_code (single column first - simple) emp_with_dept = pd.merge(employee_details, department_info, on='dept_code', how='left') print("\nEmployees with department info:") print(emp_with_dept)
Complex merge with multiple conditions
# Create employee assignments assignments = pd.DataFrame({ 'emp_id': [1, 1, 2, 3, 4, 5], 'project_id': ['P1', 'P2', 'P1', 'P3', 'P2', 'P1'], 'role': ['Lead', 'Contributor', 'Contributor', 'Lead', 'Contributor', 'Lead'] }) project_details = pd.DataFrame({ 'project_id': ['P1', 'P2', 'P3', 'P4'], 'project_name': ['Alpha', 'Beta', 'Gamma', 'Delta'], 'budget': [100000, 75000, 90000, 80000], 'lead_dept': ['Sales', 'IT', 'Sales', 'HR'] }) print("\nEmployee Assignments:") print(assignments) print("\nProject Details:") print(project_details) # Merge on multiple keys where available full_assignments = pd.merge( assignments, project_details, on='project_id', how='left' ) print("\nFull assignment details:") print(full_assignments)
Using indicator to track merge source
# Add _merge column to see where data came from merged_with_indicator = pd.merge( employees, contact, on='emp_id', how='outer', indicator=True # Adds _merge column ) print("\nMerge with indicator:") print(merged_with_indicator) # Analyze merge results print("\nMerge summary:") print(merged_with_indicator['_merge'].value_counts()) # 'both' = in both DataFrames # 'left_only' = only in employees # 'right_only' = only in contact
π― Complete Hands-on Checklist
import pandas as pd import numpy as np # Create sample datasets customers = pd.DataFrame({ 'customer_id': [1, 2, 3, 4, 5], 'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'], 'city': ['NYC', 'LA', 'Chicago', 'NYC', 'LA'] }) orders = pd.DataFrame({ 'order_id': [101, 102, 103, 104, 105], 'customer_id': [1, 2, 1, 3, 6], # Note: customer 6 doesn't exist 'amount': [250, 100, 300, 150, 200], 'date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05'] }) products = pd.DataFrame({ 'product_id': [1001, 1002, 1003], 'product_name': ['Laptop', 'Mouse', 'Keyboard'], 'price': [1000, 25, 75] }) order_items = pd.DataFrame({ 'order_id': [101, 101, 102, 103, 104, 104], 'product_id': [1001, 1002, 1002, 1003, 1001, 1003], 'quantity': [1, 2, 1, 1, 1, 1] }) print("="*60) print("1. VERTICAL CONCATENATION - Stack customers") print("="*60) more_customers = pd.DataFrame({ 'customer_id': [6, 7], 'name': ['Frank', 'Grace'], 'city': ['Boston', 'Seattle'] }) all_customers = pd.concat([customers, more_customers], ignore_index=True) print(all_customers) print("\n" + "="*60) print("2. HORIZONTAL CONCATENATION - Add product info") print("="*60) products_with_price = pd.concat([products, products[['price']] * 1.1], axis=1) products_with_price.columns = ['product_id', 'product_name', 'price', 'price_with_tax'] print(products_with_price) print("\n" + "="*60) print("3. INNER JOIN - Customers with orders") print("="*60) customers_with_orders = pd.merge(customers, orders, on='customer_id', how='inner') print(customers_with_orders) print("\n" + "="*60) print("4. LEFT JOIN - All customers with orders") print("="*60) all_customers_orders = pd.merge(customers, orders, on='customer_id', how='left') print(all_customers_orders) print(f"\nCustomers with no orders: {all_customers_orders['order_id'].isnull().sum()}") print("\n" + "="*60) print("5. MULTI-COLUMN MERGE - Orders with products") print("="*60) full_orders = pd.merge( order_items, orders[['order_id', 'customer_id']], on='order_id', how='left' ) full_orders = pd.merge(full_orders, products, on='product_id', how='left') full_orders['total'] = full_orders['quantity'] * full_orders['price'] print(full_orders)
β Common Mistakes & Solutions
| Mistake | Why it fails | Correct way |
|---|---|---|
pd.concat([df1, df2]) when columns differ | Creates NaN for missing columns | Use join='inner' or handle NaN |
Forgetting axis=1 for horizontal concat | Defaults to vertical (axis=0) | Specify axis=1 |
Using merge() without specifying how | Defaults to inner join | Explicitly state how='left'/'right'/'outer'/'inner' |
Merging on different column names without left_on/right_on | KeyError | Use left_on and right_on |
| Not checking for duplicate column names | _x, _y suffixes added | Use suffixes parameter or drop columns |
π Practice Exercises
Create two DataFrames with employee Q1 and Q2 bonuses and concatenate them vertically.
Join the employees dataset with itself to find employees who have the same department (self-join).
Perform a full outer join on customers and orders, then fill missing order amounts with 0.
Merge three datasets: employees, departments, and salaries on appropriate keys.
Use
indicator=Trueto analyze which customers have orders and which don't.
β Summary of Day 6
| Operation | Method | Key Parameters | SQL Equivalent |
|---|---|---|---|
| Vertical concat | pd.concat([df1, df2]) | axis=0, ignore_index | UNION ALL |
| Horizontal concat | pd.concat([df1, df2], axis=1) | axis=1, join | - |
| Inner join | pd.merge(df1, df2, how='inner') | on, left_on, right_on | INNER JOIN |
| Left join | pd.merge(df1, df2, how='left') | on, indicator | LEFT JOIN |
| Right join | pd.merge(df1, df2, how='right') | on | RIGHT JOIN |
| Outer join | pd.merge(df1, df2, how='outer') | on | FULL OUTER JOIN |
π Pro Tips
Always check data types before merging - keys should be same type
Use
validateparameter to check merge cardinality:validate='one_to_one'- ensures unique keysvalidate='many_to_one'- left keys unique, right can duplicate
Set
copy=Falsefor large DataFrames to improve performanceUse
pd.concat([df1, df2], keys=['src1', 'src2'])to track originChain merges for multiple datasets:
df1.merge(df2).merge(df3)
π Advanced Example: Complex Data Pipeline
# Complete data integration pipeline result = ( employees .merge(contact, on='emp_id', how='left', suffixes=('', '_contact')) .merge(performance, left_on='emp_id', right_on='employee_id', how='left') .merge( projects.groupby('emp_id').agg( total_hours=('hours', 'sum'), num_projects=('project', 'count') ).reset_index(), on='emp_id', how='left' ) .fillna({'score': 0, 'rating': 'No Rating', 'total_hours': 0, 'num_projects': 0}) ) print("\nComplete Employee Data Mart:") print(result.round(2))
Discussion (0)