Pandas Day 6: concat(), merge() & SQL-style Joins

Table of Contents

📌 Day 6: Merging, Joining & Concatenati… H2

Prerequisite Setup H3

1️⃣ Vertically stack two DataFrames with con… H2

What is vertical concatenation? H3

Basic vertical concatenation H3

Handling different column names H3

Adding keys to identify source H3

Handling index conflicts H3

Real-world example: Monthly sales data H3

2️⃣ Horizontally concatenate along columns H2

Basic horizontal concatenation H3

Handling different row counts H3

Aligning by index H3

Real-world: Combining features and target H3

3️⃣ Perform inner join on common key column H2

What is an Inner Join? H3

Basic inner join with merge() H3

Inner join with different column names H3

Inner join with multiple matching keys H3

Visualizing joins with small data H3

4️⃣ Do left join and observe missing values H2

What is a Left Join? H3

Basic left join H3

Observing missing values H3

Right join (opposite of left join) H3

Left join with different column names H3

Handling missing values after left join H3

5️⃣ Merge on multiple columns with different… H2

Merging on multiple columns H3

Merging with different column names on multi… H3

Complex merge with multiple conditions H3

Using indicator to track merge source H3

🎯 Complete Hands-on Checklist H2

❌ Common Mistakes & Solutions H2

📝 Practice Exercises H2

✅ Summary of Day 6 H2

🎓 Pro Tips H2

🚀 Advanced Example: Complex Data Pipeline H2

Quick Stats

Words: 2,488

Est. time: 13 min

Readability: Advanced

Word Cloud

fenestral crare ROXBURGH contemplatist soterial reauthentication searched potash IMPENETRABLENESS moidores pupilate subsalt apporter ANTILOPINE PUNCTILIO

📌 Day 6: Merging, Joining & Concatenation

Prerequisite Setup

import pandas as pd
import numpy as np

# Create multiple datasets for different scenarios
np.random.seed(42)

# Dataset 1: Employee information
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5, 6],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'Marketing'],
    'salary': [50000, 60000, 75000, 55000, 68000, 62000]
})

# Dataset 2: Employee contact info
contact = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5, 7],  # Note: emp_id 7 doesn't exist in employees
    'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com', 
              'diana@email.com', 'eve@email.com', 'frank@email.com'],
    'phone': ['123-4567', '234-5678', '345-6789', '456-7890', '567-8901', '678-9012']
})

# Dataset 3: Performance scores (different key name)
performance = pd.DataFrame({
    'employee_id': [1, 2, 3, 4, 5, 6],
    'score': [85, 92, 78, 88, 95, 82],
    'rating': ['B', 'A', 'C', 'B', 'A', 'B']
})

# Dataset 4: Additional employees for concatenation
new_employees = pd.DataFrame({
    'emp_id': [7, 8],
    'name': ['Grace', 'Henry'],
    'department': ['HR', 'IT'],
    'salary': [59000, 71000]
})

# Dataset 5: Project assignments (multiple keys)
projects = pd.DataFrame({
    'emp_id': [1, 1, 2, 3, 4, 5, 5, 6],
    'project': ['Alpha', 'Beta', 'Gamma', 'Alpha', 'Delta', 'Gamma', 'Epsilon', 'Beta'],
    'hours': [40, 10, 35, 45, 30, 25, 15, 40]
})

print("Employees Dataset:")
print(employees)
print("\n" + "="*60)

print("Contact Dataset:")
print(contact)
print("\n" + "="*60)

print("Performance Dataset:")
print(performance)
print("\n" + "="*60)

print("New Employees Dataset:")
print(new_employees)
print("\n" + "="*60)

print("Projects Dataset:")
print(projects.head())
print("\n" + "="*60)

1️⃣ Vertically stack two DataFrames with concat

What is vertical concatenation?

Stacking rows from multiple DataFrames on top of each other (like SQL UNION ALL).

Basic vertical concatenation

# Stack employees and new_employees vertically
all_employees = pd.concat([employees, new_employees], ignore_index=True)
print("All Employees (vertically stacked):")
print(all_employees)
print(f"\nOriginal rows: {len(employees)} + {len(new_employees)} = {len(all_employees)}")

Handling different column names

# Create DataFrames with different columns
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'C': [7, 8]})  # Note: different columns

print("\nDataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

# Default concat (fills missing with NaN)
combined_default = pd.concat([df1, df2], ignore_index=True)
print("\nDefault concat (fills missing with NaN):")
print(combined_default)

# Use join='inner' to keep only common columns
combined_inner = pd.concat([df1, df2], ignore_index=True, join='inner')
print("\nInner join (only common columns):")
print(combined_inner)

Adding keys to identify source

# Add keys to know which DataFrame each row came from
employees_with_key = pd.concat(
    [employees, new_employees], 
    keys=['Original', 'New'],
    names=['Source']
)
print("\nConcatenation with keys:")
print(employees_with_key)

# Reset index to make it cleaner
employees_with_key_reset = employees_with_key.reset_index(level=0)
print("\nWith reset index:")
print(employees_with_key_reset)

Handling index conflicts

# Create DataFrames with overlapping indices
df_a = pd.DataFrame({'X': [1, 2]}, index=['a', 'b'])
df_b = pd.DataFrame({'X': [3, 4]}, index=['a', 'c'])

print("\nDataFrame A:")
print(df_a)
print("\nDataFrame B:")
print(df_b)

# Default - keeps all indices (may create duplicates)
concat_default = pd.concat([df_a, df_b])
print("\nDefault concat (allows duplicate indices):")
print(concat_default)

# Verify if index is unique
concat_verify = pd.concat([df_a, df_b], verify_integrity=True)
# This would raise an error if indices overlap
print("\nWith verify_integrity=True (would error on overlap):")
# Uncomment to see error: pd.concat([df_a, df_b], verify_integrity=True)

Real-world example: Monthly sales data

# Sales data for different months
jan_sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=5, freq='D'),
    'product': ['A', 'B', 'A', 'C', 'B'],
    'sales': [100, 150, 120, 130, 110]
})

feb_sales = pd.DataFrame({
    'date': pd.date_range('2024-02-01', periods=5, freq='D'),
    'product': ['A', 'B', 'C', 'A', 'B'],
    'sales': [110, 140, 125, 115, 135]
})

# Stack monthly data
all_sales = pd.concat([jan_sales, feb_sales], ignore_index=True)
print("\nAll Sales Data (Jan + Feb):")
print(all_sales)

# Add month column using keys
all_sales_with_month = pd.concat(
    [jan_sales, feb_sales], 
    keys=['January', 'February'],
    names=['Month']
).reset_index(level=0)
print("\nSales with month identifier:")
print(all_sales_with_month.head())

2️⃣ Horizontally concatenate along columns

Basic horizontal concatenation

# Create two DataFrames with same number of rows
df_left = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df_right = pd.DataFrame({'age': [25, 30, 35], 'city': ['NYC', 'LA', 'Chicago']})

print("Left DataFrame:")
print(df_left)
print("\nRight DataFrame:")
print(df_right)

# Horizontal concatenation (axis=1)
combined_horizontal = pd.concat([df_left, df_right], axis=1)
print("\nHorizontally concatenated:")
print(combined_horizontal)

Handling different row counts

# DataFrames with different number of rows
df_a = pd.DataFrame({'A': [1, 2, 3]})
df_b = pd.DataFrame({'B': [4, 5]})  # Only 2 rows

print("\nDataFrame A (3 rows):")
print(df_a)
print("\nDataFrame B (2 rows):")
print(df_b)

# Default (outer join) - fills missing with NaN
concat_horiz_outer = pd.concat([df_a, df_b], axis=1)
print("\nHorizontal concat (outer join - default):")
print(concat_horiz_outer)

# Inner join - only keeps common rows
concat_horiz_inner = pd.concat([df_a, df_b], axis=1, join='inner')
print("\nHorizontal concat (inner join):")
print(concat_horiz_inner)

Aligning by index

# Create DataFrames with meaningful indices
sales_q1 = pd.DataFrame({
    'revenue': [1000, 1200, 1100],
    'cost': [600, 700, 650]
}, index=['Product A', 'Product B', 'Product C'])

sales_q2 = pd.DataFrame({
    'revenue': [1150, 1300, 1050],
    'cost': [650, 750, 620]
}, index=['Product A', 'Product B', 'Product D'])  # Different product

print("\nQ1 Sales:")
print(sales_q1)
print("\nQ2 Sales:")
print(sales_q2)

# Horizontal concat aligns by index
combined_quarters = pd.concat([sales_q1, sales_q2], axis=1, keys=['Q1', 'Q2'])
print("\nCombined Quarters (aligned by product):")
print(combined_quarters)

# Calculate growth where data exists
combined_quarters['Growth'] = (
    combined_quarters[('Q2', 'revenue')] - combined_quarters[('Q1', 'revenue')]
) / combined_quarters[('Q1', 'revenue')] * 100
print("\nWith growth calculation:")
print(combined_quarters)

Real-world: Combining features and target

# Features DataFrame
features = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'age': [25, 30, 35, 28, 32],
    'income': [50000, 60000, 75000, 55000, 68000],
    'purchases': [10, 15, 8, 12, 20]
})

# Target variable (what we want to predict)
target = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'churned': [0, 1, 0, 0, 1]  # 1 = churned, 0 = stayed
})

print("\nFeatures (X):")
print(features)
print("\nTarget (y):")
print(target)

# Combine for modeling (horizontal concat)
model_data = pd.concat([features, target[['churned']]], axis=1)
print("\nCombined for modeling:")
print(model_data)

3️⃣ Perform inner join on common key column

What is an Inner Join?

Returns only rows where keys match in BOTH DataFrames.

Basic inner join with merge()

# Inner join employees and contact on emp_id
inner_join_result = pd.merge(employees, contact, on='emp_id', how='inner')
print("Inner Join (employees ∩ contact):")
print(inner_join_result)
print(f"\nRows in result: {len(inner_join_result)}")
print(f"Note: emp_id 6 and 7 are missing (no match in both)")

Inner join with different column names

# Join employees (emp_id) with performance (employee_id)
inner_diff_keys = pd.merge(
    employees, 
    performance, 
    left_on='emp_id', 
    right_on='employee_id', 
    how='inner'
)
print("\nInner join with different key names:")
print(inner_diff_keys)

# Drop the duplicate key column
inner_diff_keys_clean = inner_diff_keys.drop('employee_id', axis=1)
print("\nCleaned (removed duplicate key):")
print(inner_diff_keys_clean)

Inner join with multiple matching keys

# Create department managers dataset
dept_managers = pd.DataFrame({
    'department': ['Sales', 'IT', 'HR', 'Marketing'],
    'manager': ['John', 'Sarah', 'Mike', 'Lisa'],
    'location': ['Floor 1', 'Floor 2', 'Floor 1', 'Floor 3']
})

print("\nDepartment Managers:")
print(dept_managers)

# Inner join employees with managers on department
emp_with_manager = pd.merge(employees, dept_managers, on='department', how='inner')
print("\nEmployees with managers (inner join):")
print(emp_with_manager)

Visualizing joins with small data

# Simple example to understand inner join
df1 = pd.DataFrame({
    'key': ['A', 'B', 'C', 'D'],
    'value1': [1, 2, 3, 4]
})

df2 = pd.DataFrame({
    'key': ['B', 'C', 'E', 'F'],
    'value2': [5, 6, 7, 8]
})

print("\nDataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

# Inner join (only B and C appear in both)
inner_demo = pd.merge(df1, df2, on='key', how='inner')
print("\nInner Join Result (only matching keys B & C):")
print(inner_demo)

4️⃣ Do left join and observe missing values

What is a Left Join?

Keeps ALL rows from left DataFrame, matches from right where available (fills NaN for missing).

Basic left join

# Left join employees (left) with contact (right)
left_join_result = pd.merge(employees, contact, on='emp_id', how='left')
print("Left Join (all employees, contact info where available):")
print(left_join_result)
print(f"\nMissing emails: {left_join_result['email'].isnull().sum()}")

Observing missing values

# Identify which employees have missing contact info
missing_contact = left_join_result[left_join_result['email'].isnull()]
print("\nEmployees missing contact info:")
print(missing_contact[['emp_id', 'name', 'department']])

# Count missing values per column
print(f"\nMissing values summary:")
print(left_join_result[['email', 'phone']].isnull().sum())

Right join (opposite of left join)

# Right join (all contact records, employee info where available)
right_join_result = pd.merge(employees, contact, on='emp_id', how='right')
print("\nRight Join (all contact records):")
print(right_join_result)
# Note: emp_id 7 has no employee info

Left join with different column names

# Left join employees with performance (different column names)
left_join_diff = pd.merge(
    employees, 
    performance, 
    left_on='emp_id', 
    right_on='employee_id', 
    how='left'
)
print("\nLeft join with different key names:")
print(left_join_diff)

# Notice employee_id column exists, can drop it
left_join_clean = left_join_diff.drop('employee_id', axis=1)
print("\nCleaned left join:")
print(left_join_clean)

Handling missing values after left join

# Fill missing performance scores with default value
left_join_clean['score'] = left_join_clean['score'].fillna(0)
left_join_clean['rating'] = left_join_clean['rating'].fillna('No Rating')

print("\nAfter filling missing values:")
print(left_join_clean[['name', 'score', 'rating']])

# Or flag missing values
left_join_clean['has_performance'] = ~left_join_clean['score'].isnull()

5️⃣ Merge on multiple columns with different names

Merging on multiple columns

# Create datasets with composite keys
sales_targets = pd.DataFrame({
    'year': [2023, 2023, 2024, 2024],
    'quarter': [1, 2, 1, 2],
    'target': [100000, 110000, 120000, 130000]
})

actual_sales = pd.DataFrame({
    'year': [2023, 2023, 2024, 2024, 2024],
    'quarter': [1, 2, 1, 2, 2],
    'region': ['North', 'South', 'East', 'West', 'North'],
    'sales': [95000, 105000, 125000, 128000, 102000]
})

print("Sales Targets (multiple keys):")
print(sales_targets)
print("\nActual Sales:")
print(actual_sales)

# Merge on multiple columns (year and quarter)
sales_comparison = pd.merge(
    actual_sales, 
    sales_targets, 
    on=['year', 'quarter'], 
    how='left'
)
print("\nSales vs Targets (merged on year & quarter):")
print(sales_comparison)

# Calculate performance
sales_comparison['performance_vs_target'] = (
    (sales_comparison['sales'] - sales_comparison['target']) / sales_comparison['target'] * 100
).round(2)
print("\nWith performance calculation:")
print(sales_comparison)

Merging with different column names on multiple keys

# Dataset with different column names
employee_details = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5],
    'emp_name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'dept_code': ['SAL', 'IT', 'SAL', 'HR', 'IT']
})

department_info = pd.DataFrame({
    'dept_name': ['Sales', 'IT', 'HR', 'Marketing'],
    'dept_code': ['SAL', 'IT', 'HR', 'MKT'],
    'floor': [1, 2, 1, 3],
    'manager': ['John', 'Sarah', 'Mike', 'Lisa']
})

print("\nEmployee Details:")
print(employee_details)
print("\nDepartment Info:")
print(department_info)

# Merge on dept_code (single column first - simple)
emp_with_dept = pd.merge(employee_details, department_info, on='dept_code', how='left')
print("\nEmployees with department info:")
print(emp_with_dept)

Complex merge with multiple conditions

# Create employee assignments
assignments = pd.DataFrame({
    'emp_id': [1, 1, 2, 3, 4, 5],
    'project_id': ['P1', 'P2', 'P1', 'P3', 'P2', 'P1'],
    'role': ['Lead', 'Contributor', 'Contributor', 'Lead', 'Contributor', 'Lead']
})

project_details = pd.DataFrame({
    'project_id': ['P1', 'P2', 'P3', 'P4'],
    'project_name': ['Alpha', 'Beta', 'Gamma', 'Delta'],
    'budget': [100000, 75000, 90000, 80000],
    'lead_dept': ['Sales', 'IT', 'Sales', 'HR']
})

print("\nEmployee Assignments:")
print(assignments)
print("\nProject Details:")
print(project_details)

# Merge on multiple keys where available
full_assignments = pd.merge(
    assignments, 
    project_details, 
    on='project_id', 
    how='left'
)
print("\nFull assignment details:")
print(full_assignments)

Using indicator to track merge source

# Add _merge column to see where data came from
merged_with_indicator = pd.merge(
    employees, 
    contact, 
    on='emp_id', 
    how='outer',
    indicator=True  # Adds _merge column
)
print("\nMerge with indicator:")
print(merged_with_indicator)

# Analyze merge results
print("\nMerge summary:")
print(merged_with_indicator['_merge'].value_counts())
# 'both' = in both DataFrames
# 'left_only' = only in employees
# 'right_only' = only in contact

🎯 Complete Hands-on Checklist

import pandas as pd
import numpy as np

# Create sample datasets
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'city': ['NYC', 'LA', 'Chicago', 'NYC', 'LA']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 2, 1, 3, 6],  # Note: customer 6 doesn't exist
    'amount': [250, 100, 300, 150, 200],
    'date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05']
})

products = pd.DataFrame({
    'product_id': [1001, 1002, 1003],
    'product_name': ['Laptop', 'Mouse', 'Keyboard'],
    'price': [1000, 25, 75]
})

order_items = pd.DataFrame({
    'order_id': [101, 101, 102, 103, 104, 104],
    'product_id': [1001, 1002, 1002, 1003, 1001, 1003],
    'quantity': [1, 2, 1, 1, 1, 1]
})

print("="*60)
print("1. VERTICAL CONCATENATION - Stack customers")
print("="*60)
more_customers = pd.DataFrame({
    'customer_id': [6, 7],
    'name': ['Frank', 'Grace'],
    'city': ['Boston', 'Seattle']
})
all_customers = pd.concat([customers, more_customers], ignore_index=True)
print(all_customers)

print("\n" + "="*60)
print("2. HORIZONTAL CONCATENATION - Add product info")
print("="*60)
products_with_price = pd.concat([products, products[['price']] * 1.1], axis=1)
products_with_price.columns = ['product_id', 'product_name', 'price', 'price_with_tax']
print(products_with_price)

print("\n" + "="*60)
print("3. INNER JOIN - Customers with orders")
print("="*60)
customers_with_orders = pd.merge(customers, orders, on='customer_id', how='inner')
print(customers_with_orders)

print("\n" + "="*60)
print("4. LEFT JOIN - All customers with orders")
print("="*60)
all_customers_orders = pd.merge(customers, orders, on='customer_id', how='left')
print(all_customers_orders)
print(f"\nCustomers with no orders: {all_customers_orders['order_id'].isnull().sum()}")

print("\n" + "="*60)
print("5. MULTI-COLUMN MERGE - Orders with products")
print("="*60)
full_orders = pd.merge(
    order_items, 
    orders[['order_id', 'customer_id']], 
    on='order_id', 
    how='left'
)
full_orders = pd.merge(full_orders, products, on='product_id', how='left')
full_orders['total'] = full_orders['quantity'] * full_orders['price']
print(full_orders)

❌ Common Mistakes & Solutions

Mistake	Why it fails	Correct way
`pd.concat([df1, df2])` when columns differ	Creates NaN for missing columns	Use `join='inner'` or handle NaN
Forgetting `axis=1` for horizontal concat	Defaults to vertical (axis=0)	Specify `axis=1`
Using `merge()` without specifying `how`	Defaults to inner join	Explicitly state `how='left'/'right'/'outer'/'inner'`
Merging on different column names without `left_on/right_on`	KeyError	Use `left_on` and `right_on`
Not checking for duplicate column names	_x, _y suffixes added	Use `suffixes` parameter or drop columns

📝 Practice Exercises

Create two DataFrames with employee Q1 and Q2 bonuses and concatenate them vertically.
Join the employees dataset with itself to find employees who have the same department (self-join).
Perform a full outer join on customers and orders, then fill missing order amounts with 0.
Merge three datasets: employees, departments, and salaries on appropriate keys.
Use indicator=True to analyze which customers have orders and which don't.

✅ Summary of Day 6

Operation	Method	Key Parameters	SQL Equivalent
Vertical concat	`pd.concat([df1, df2])`	`axis=0`, `ignore_index`	`UNION ALL`
Horizontal concat	`pd.concat([df1, df2], axis=1)`	`axis=1`, `join`	-
Inner join	`pd.merge(df1, df2, how='inner')`	`on`, `left_on`, `right_on`	`INNER JOIN`
Left join	`pd.merge(df1, df2, how='left')`	`on`, `indicator`	`LEFT JOIN`
Right join	`pd.merge(df1, df2, how='right')`	`on`	`RIGHT JOIN`
Outer join	`pd.merge(df1, df2, how='outer')`	`on`	`FULL OUTER JOIN`

🎓 Pro Tips

Always check data types before merging - keys should be same type
Use validate parameter to check merge cardinality:
- validate='one_to_one' - ensures unique keys
- validate='many_to_one' - left keys unique, right can duplicate
Set copy=False for large DataFrames to improve performance
Use pd.concat([df1, df2], keys=['src1', 'src2']) to track origin
Chain merges for multiple datasets: df1.merge(df2).merge(df3)

🚀 Advanced Example: Complex Data Pipeline

# Complete data integration pipeline
result = (
    employees
    .merge(contact, on='emp_id', how='left', suffixes=('', '_contact'))
    .merge(performance, left_on='emp_id', right_on='employee_id', how='left')
    .merge(
        projects.groupby('emp_id').agg(
            total_hours=('hours', 'sum'),
            num_projects=('project', 'count')
        ).reset_index(),
        on='emp_id',
        how='left'
    )
    .fillna({'score': 0, 'rating': 'No Rating', 'total_hours': 0, 'num_projects': 0})
)

print("\nComplete Employee Data Mart:")
print(result.round(2))

Pandas Day 6 — Merging, Joining & Concatenation