MadiisAttendance
Combining DataFrames in Pandas 13 min read 2,488 words 128 sentences uni

Pandas Day 6 β€” Merging, Joining & Concatenation

M Usman May 26, 2026
12 0 0 score
0%
Your Progress
0/38 sections
Reading Speed
0
words/min
Time Spent
00:00
Table of Contents

Quick Stats
Words: 2,488
Est. time: 13 min
Readability: Advanced

Word Cloud
fenestral crare ROXBURGH contemplatist soterial reauthentication searched potash IMPENETRABLENESS moidores pupilate subsalt apporter ANTILOPINE PUNCTILIO

πŸ“Œ Day 6: Merging, Joining & Concatenation

Prerequisite Setup

python
import pandas as pd
import numpy as np

# Create multiple datasets for different scenarios
np.random.seed(42)

# Dataset 1: Employee information
employees = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5, 6],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank'],
    'department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'Marketing'],
    'salary': [50000, 60000, 75000, 55000, 68000, 62000]
})

# Dataset 2: Employee contact info
contact = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5, 7],  # Note: emp_id 7 doesn't exist in employees
    'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com', 
              'diana@email.com', 'eve@email.com', 'frank@email.com'],
    'phone': ['123-4567', '234-5678', '345-6789', '456-7890', '567-8901', '678-9012']
})

# Dataset 3: Performance scores (different key name)
performance = pd.DataFrame({
    'employee_id': [1, 2, 3, 4, 5, 6],
    'score': [85, 92, 78, 88, 95, 82],
    'rating': ['B', 'A', 'C', 'B', 'A', 'B']
})

# Dataset 4: Additional employees for concatenation
new_employees = pd.DataFrame({
    'emp_id': [7, 8],
    'name': ['Grace', 'Henry'],
    'department': ['HR', 'IT'],
    'salary': [59000, 71000]
})

# Dataset 5: Project assignments (multiple keys)
projects = pd.DataFrame({
    'emp_id': [1, 1, 2, 3, 4, 5, 5, 6],
    'project': ['Alpha', 'Beta', 'Gamma', 'Alpha', 'Delta', 'Gamma', 'Epsilon', 'Beta'],
    'hours': [40, 10, 35, 45, 30, 25, 15, 40]
})

print("Employees Dataset:")
print(employees)
print("\n" + "="*60)

print("Contact Dataset:")
print(contact)
print("\n" + "="*60)

print("Performance Dataset:")
print(performance)
print("\n" + "="*60)

print("New Employees Dataset:")
print(new_employees)
print("\n" + "="*60)

print("Projects Dataset:")
print(projects.head())
print("\n" + "="*60)

1️⃣ Vertically stack two DataFrames with concat

What is vertical concatenation?

Stacking rows from multiple DataFrames on top of each other (like SQL UNION ALL).

Basic vertical concatenation

python
# Stack employees and new_employees vertically
all_employees = pd.concat([employees, new_employees], ignore_index=True)
print("All Employees (vertically stacked):")
print(all_employees)
print(f"\nOriginal rows: {len(employees)} + {len(new_employees)} = {len(all_employees)}")

Handling different column names

python
# Create DataFrames with different columns
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'C': [7, 8]})  # Note: different columns

print("\nDataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

# Default concat (fills missing with NaN)
combined_default = pd.concat([df1, df2], ignore_index=True)
print("\nDefault concat (fills missing with NaN):")
print(combined_default)

# Use join='inner' to keep only common columns
combined_inner = pd.concat([df1, df2], ignore_index=True, join='inner')
print("\nInner join (only common columns):")
print(combined_inner)

Adding keys to identify source

python
# Add keys to know which DataFrame each row came from
employees_with_key = pd.concat(
    [employees, new_employees], 
    keys=['Original', 'New'],
    names=['Source']
)
print("\nConcatenation with keys:")
print(employees_with_key)

# Reset index to make it cleaner
employees_with_key_reset = employees_with_key.reset_index(level=0)
print("\nWith reset index:")
print(employees_with_key_reset)

Handling index conflicts

python
# Create DataFrames with overlapping indices
df_a = pd.DataFrame({'X': [1, 2]}, index=['a', 'b'])
df_b = pd.DataFrame({'X': [3, 4]}, index=['a', 'c'])

print("\nDataFrame A:")
print(df_a)
print("\nDataFrame B:")
print(df_b)

# Default - keeps all indices (may create duplicates)
concat_default = pd.concat([df_a, df_b])
print("\nDefault concat (allows duplicate indices):")
print(concat_default)

# Verify if index is unique
concat_verify = pd.concat([df_a, df_b], verify_integrity=True)
# This would raise an error if indices overlap
print("\nWith verify_integrity=True (would error on overlap):")
# Uncomment to see error: pd.concat([df_a, df_b], verify_integrity=True)

Real-world example: Monthly sales data

python
# Sales data for different months
jan_sales = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=5, freq='D'),
    'product': ['A', 'B', 'A', 'C', 'B'],
    'sales': [100, 150, 120, 130, 110]
})

feb_sales = pd.DataFrame({
    'date': pd.date_range('2024-02-01', periods=5, freq='D'),
    'product': ['A', 'B', 'C', 'A', 'B'],
    'sales': [110, 140, 125, 115, 135]
})

# Stack monthly data
all_sales = pd.concat([jan_sales, feb_sales], ignore_index=True)
print("\nAll Sales Data (Jan + Feb):")
print(all_sales)

# Add month column using keys
all_sales_with_month = pd.concat(
    [jan_sales, feb_sales], 
    keys=['January', 'February'],
    names=['Month']
).reset_index(level=0)
print("\nSales with month identifier:")
print(all_sales_with_month.head())

2️⃣ Horizontally concatenate along columns

Basic horizontal concatenation

python
# Create two DataFrames with same number of rows
df_left = pd.DataFrame({'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie']})
df_right = pd.DataFrame({'age': [25, 30, 35], 'city': ['NYC', 'LA', 'Chicago']})

print("Left DataFrame:")
print(df_left)
print("\nRight DataFrame:")
print(df_right)

# Horizontal concatenation (axis=1)
combined_horizontal = pd.concat([df_left, df_right], axis=1)
print("\nHorizontally concatenated:")
print(combined_horizontal)

Handling different row counts

python
# DataFrames with different number of rows
df_a = pd.DataFrame({'A': [1, 2, 3]})
df_b = pd.DataFrame({'B': [4, 5]})  # Only 2 rows

print("\nDataFrame A (3 rows):")
print(df_a)
print("\nDataFrame B (2 rows):")
print(df_b)

# Default (outer join) - fills missing with NaN
concat_horiz_outer = pd.concat([df_a, df_b], axis=1)
print("\nHorizontal concat (outer join - default):")
print(concat_horiz_outer)

# Inner join - only keeps common rows
concat_horiz_inner = pd.concat([df_a, df_b], axis=1, join='inner')
print("\nHorizontal concat (inner join):")
print(concat_horiz_inner)

Aligning by index

python
# Create DataFrames with meaningful indices
sales_q1 = pd.DataFrame({
    'revenue': [1000, 1200, 1100],
    'cost': [600, 700, 650]
}, index=['Product A', 'Product B', 'Product C'])

sales_q2 = pd.DataFrame({
    'revenue': [1150, 1300, 1050],
    'cost': [650, 750, 620]
}, index=['Product A', 'Product B', 'Product D'])  # Different product

print("\nQ1 Sales:")
print(sales_q1)
print("\nQ2 Sales:")
print(sales_q2)

# Horizontal concat aligns by index
combined_quarters = pd.concat([sales_q1, sales_q2], axis=1, keys=['Q1', 'Q2'])
print("\nCombined Quarters (aligned by product):")
print(combined_quarters)

# Calculate growth where data exists
combined_quarters['Growth'] = (
    combined_quarters[('Q2', 'revenue')] - combined_quarters[('Q1', 'revenue')]
) / combined_quarters[('Q1', 'revenue')] * 100
print("\nWith growth calculation:")
print(combined_quarters)

Real-world: Combining features and target

python
# Features DataFrame
features = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'age': [25, 30, 35, 28, 32],
    'income': [50000, 60000, 75000, 55000, 68000],
    'purchases': [10, 15, 8, 12, 20]
})

# Target variable (what we want to predict)
target = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'churned': [0, 1, 0, 0, 1]  # 1 = churned, 0 = stayed
})

print("\nFeatures (X):")
print(features)
print("\nTarget (y):")
print(target)

# Combine for modeling (horizontal concat)
model_data = pd.concat([features, target[['churned']]], axis=1)
print("\nCombined for modeling:")
print(model_data)

3️⃣ Perform inner join on common key column

What is an Inner Join?

Returns only rows where keys match in BOTH DataFrames.

Basic inner join with merge()

python
# Inner join employees and contact on emp_id
inner_join_result = pd.merge(employees, contact, on='emp_id', how='inner')
print("Inner Join (employees ∩ contact):")
print(inner_join_result)
print(f"\nRows in result: {len(inner_join_result)}")
print(f"Note: emp_id 6 and 7 are missing (no match in both)")

Inner join with different column names

python
# Join employees (emp_id) with performance (employee_id)
inner_diff_keys = pd.merge(
    employees, 
    performance, 
    left_on='emp_id', 
    right_on='employee_id', 
    how='inner'
)
print("\nInner join with different key names:")
print(inner_diff_keys)

# Drop the duplicate key column
inner_diff_keys_clean = inner_diff_keys.drop('employee_id', axis=1)
print("\nCleaned (removed duplicate key):")
print(inner_diff_keys_clean)

Inner join with multiple matching keys

python
# Create department managers dataset
dept_managers = pd.DataFrame({
    'department': ['Sales', 'IT', 'HR', 'Marketing'],
    'manager': ['John', 'Sarah', 'Mike', 'Lisa'],
    'location': ['Floor 1', 'Floor 2', 'Floor 1', 'Floor 3']
})

print("\nDepartment Managers:")
print(dept_managers)

# Inner join employees with managers on department
emp_with_manager = pd.merge(employees, dept_managers, on='department', how='inner')
print("\nEmployees with managers (inner join):")
print(emp_with_manager)

Visualizing joins with small data

python
# Simple example to understand inner join
df1 = pd.DataFrame({
    'key': ['A', 'B', 'C', 'D'],
    'value1': [1, 2, 3, 4]
})

df2 = pd.DataFrame({
    'key': ['B', 'C', 'E', 'F'],
    'value2': [5, 6, 7, 8]
})

print("\nDataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)

# Inner join (only B and C appear in both)
inner_demo = pd.merge(df1, df2, on='key', how='inner')
print("\nInner Join Result (only matching keys B & C):")
print(inner_demo)

4️⃣ Do left join and observe missing values

What is a Left Join?

Keeps ALL rows from left DataFrame, matches from right where available (fills NaN for missing).

Basic left join

python
# Left join employees (left) with contact (right)
left_join_result = pd.merge(employees, contact, on='emp_id', how='left')
print("Left Join (all employees, contact info where available):")
print(left_join_result)
print(f"\nMissing emails: {left_join_result['email'].isnull().sum()}")

Observing missing values

python
# Identify which employees have missing contact info
missing_contact = left_join_result[left_join_result['email'].isnull()]
print("\nEmployees missing contact info:")
print(missing_contact[['emp_id', 'name', 'department']])

# Count missing values per column
print(f"\nMissing values summary:")
print(left_join_result[['email', 'phone']].isnull().sum())

Right join (opposite of left join)

python
# Right join (all contact records, employee info where available)
right_join_result = pd.merge(employees, contact, on='emp_id', how='right')
print("\nRight Join (all contact records):")
print(right_join_result)
# Note: emp_id 7 has no employee info

Left join with different column names

python
# Left join employees with performance (different column names)
left_join_diff = pd.merge(
    employees, 
    performance, 
    left_on='emp_id', 
    right_on='employee_id', 
    how='left'
)
print("\nLeft join with different key names:")
print(left_join_diff)

# Notice employee_id column exists, can drop it
left_join_clean = left_join_diff.drop('employee_id', axis=1)
print("\nCleaned left join:")
print(left_join_clean)

Handling missing values after left join

python
# Fill missing performance scores with default value
left_join_clean['score'] = left_join_clean['score'].fillna(0)
left_join_clean['rating'] = left_join_clean['rating'].fillna('No Rating')

print("\nAfter filling missing values:")
print(left_join_clean[['name', 'score', 'rating']])

# Or flag missing values
left_join_clean['has_performance'] = ~left_join_clean['score'].isnull()

5️⃣ Merge on multiple columns with different names

Merging on multiple columns

python
# Create datasets with composite keys
sales_targets = pd.DataFrame({
    'year': [2023, 2023, 2024, 2024],
    'quarter': [1, 2, 1, 2],
    'target': [100000, 110000, 120000, 130000]
})

actual_sales = pd.DataFrame({
    'year': [2023, 2023, 2024, 2024, 2024],
    'quarter': [1, 2, 1, 2, 2],
    'region': ['North', 'South', 'East', 'West', 'North'],
    'sales': [95000, 105000, 125000, 128000, 102000]
})

print("Sales Targets (multiple keys):")
print(sales_targets)
print("\nActual Sales:")
print(actual_sales)

# Merge on multiple columns (year and quarter)
sales_comparison = pd.merge(
    actual_sales, 
    sales_targets, 
    on=['year', 'quarter'], 
    how='left'
)
print("\nSales vs Targets (merged on year & quarter):")
print(sales_comparison)

# Calculate performance
sales_comparison['performance_vs_target'] = (
    (sales_comparison['sales'] - sales_comparison['target']) / sales_comparison['target'] * 100
).round(2)
print("\nWith performance calculation:")
print(sales_comparison)

Merging with different column names on multiple keys

python
# Dataset with different column names
employee_details = pd.DataFrame({
    'emp_id': [1, 2, 3, 4, 5],
    'emp_name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'dept_code': ['SAL', 'IT', 'SAL', 'HR', 'IT']
})

department_info = pd.DataFrame({
    'dept_name': ['Sales', 'IT', 'HR', 'Marketing'],
    'dept_code': ['SAL', 'IT', 'HR', 'MKT'],
    'floor': [1, 2, 1, 3],
    'manager': ['John', 'Sarah', 'Mike', 'Lisa']
})

print("\nEmployee Details:")
print(employee_details)
print("\nDepartment Info:")
print(department_info)

# Merge on dept_code (single column first - simple)
emp_with_dept = pd.merge(employee_details, department_info, on='dept_code', how='left')
print("\nEmployees with department info:")
print(emp_with_dept)

Complex merge with multiple conditions

python
# Create employee assignments
assignments = pd.DataFrame({
    'emp_id': [1, 1, 2, 3, 4, 5],
    'project_id': ['P1', 'P2', 'P1', 'P3', 'P2', 'P1'],
    'role': ['Lead', 'Contributor', 'Contributor', 'Lead', 'Contributor', 'Lead']
})

project_details = pd.DataFrame({
    'project_id': ['P1', 'P2', 'P3', 'P4'],
    'project_name': ['Alpha', 'Beta', 'Gamma', 'Delta'],
    'budget': [100000, 75000, 90000, 80000],
    'lead_dept': ['Sales', 'IT', 'Sales', 'HR']
})

print("\nEmployee Assignments:")
print(assignments)
print("\nProject Details:")
print(project_details)

# Merge on multiple keys where available
full_assignments = pd.merge(
    assignments, 
    project_details, 
    on='project_id', 
    how='left'
)
print("\nFull assignment details:")
print(full_assignments)

Using indicator to track merge source

python
# Add _merge column to see where data came from
merged_with_indicator = pd.merge(
    employees, 
    contact, 
    on='emp_id', 
    how='outer',
    indicator=True  # Adds _merge column
)
print("\nMerge with indicator:")
print(merged_with_indicator)

# Analyze merge results
print("\nMerge summary:")
print(merged_with_indicator['_merge'].value_counts())
# 'both' = in both DataFrames
# 'left_only' = only in employees
# 'right_only' = only in contact

🎯 Complete Hands-on Checklist

python
import pandas as pd
import numpy as np

# Create sample datasets
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'city': ['NYC', 'LA', 'Chicago', 'NYC', 'LA']
})

orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105],
    'customer_id': [1, 2, 1, 3, 6],  # Note: customer 6 doesn't exist
    'amount': [250, 100, 300, 150, 200],
    'date': ['2024-01-01', '2024-01-02', '2024-01-03', '2024-01-04', '2024-01-05']
})

products = pd.DataFrame({
    'product_id': [1001, 1002, 1003],
    'product_name': ['Laptop', 'Mouse', 'Keyboard'],
    'price': [1000, 25, 75]
})

order_items = pd.DataFrame({
    'order_id': [101, 101, 102, 103, 104, 104],
    'product_id': [1001, 1002, 1002, 1003, 1001, 1003],
    'quantity': [1, 2, 1, 1, 1, 1]
})

print("="*60)
print("1. VERTICAL CONCATENATION - Stack customers")
print("="*60)
more_customers = pd.DataFrame({
    'customer_id': [6, 7],
    'name': ['Frank', 'Grace'],
    'city': ['Boston', 'Seattle']
})
all_customers = pd.concat([customers, more_customers], ignore_index=True)
print(all_customers)

print("\n" + "="*60)
print("2. HORIZONTAL CONCATENATION - Add product info")
print("="*60)
products_with_price = pd.concat([products, products[['price']] * 1.1], axis=1)
products_with_price.columns = ['product_id', 'product_name', 'price', 'price_with_tax']
print(products_with_price)

print("\n" + "="*60)
print("3. INNER JOIN - Customers with orders")
print("="*60)
customers_with_orders = pd.merge(customers, orders, on='customer_id', how='inner')
print(customers_with_orders)

print("\n" + "="*60)
print("4. LEFT JOIN - All customers with orders")
print("="*60)
all_customers_orders = pd.merge(customers, orders, on='customer_id', how='left')
print(all_customers_orders)
print(f"\nCustomers with no orders: {all_customers_orders['order_id'].isnull().sum()}")

print("\n" + "="*60)
print("5. MULTI-COLUMN MERGE - Orders with products")
print("="*60)
full_orders = pd.merge(
    order_items, 
    orders[['order_id', 'customer_id']], 
    on='order_id', 
    how='left'
)
full_orders = pd.merge(full_orders, products, on='product_id', how='left')
full_orders['total'] = full_orders['quantity'] * full_orders['price']
print(full_orders)

❌ Common Mistakes & Solutions

MistakeWhy it failsCorrect way
pd.concat([df1, df2]) when columns differCreates NaN for missing columnsUse join='inner' or handle NaN
Forgetting axis=1 for horizontal concatDefaults to vertical (axis=0)Specify axis=1
Using merge() without specifying howDefaults to inner joinExplicitly state how='left'/'right'/'outer'/'inner'
Merging on different column names without left_on/right_onKeyErrorUse left_on and right_on
Not checking for duplicate column names_x, _y suffixes addedUse suffixes parameter or drop columns

πŸ“ Practice Exercises

  1. Create two DataFrames with employee Q1 and Q2 bonuses and concatenate them vertically.

  2. Join the employees dataset with itself to find employees who have the same department (self-join).

  3. Perform a full outer join on customers and orders, then fill missing order amounts with 0.

  4. Merge three datasets: employees, departments, and salaries on appropriate keys.

  5. Use indicator=True to analyze which customers have orders and which don't.


βœ… Summary of Day 6

OperationMethodKey ParametersSQL Equivalent
Vertical concatpd.concat([df1, df2])axis=0, ignore_indexUNION ALL
Horizontal concatpd.concat([df1, df2], axis=1)axis=1, join-
Inner joinpd.merge(df1, df2, how='inner')on, left_on, right_onINNER JOIN
Left joinpd.merge(df1, df2, how='left')on, indicatorLEFT JOIN
Right joinpd.merge(df1, df2, how='right')onRIGHT JOIN
Outer joinpd.merge(df1, df2, how='outer')onFULL OUTER JOIN

πŸŽ“ Pro Tips

  1. Always check data types before merging - keys should be same type

  2. Use validate parameter to check merge cardinality:

    • validate='one_to_one' - ensures unique keys

    • validate='many_to_one' - left keys unique, right can duplicate

  3. Set copy=False for large DataFrames to improve performance

  4. Use pd.concat([df1, df2], keys=['src1', 'src2']) to track origin

  5. Chain merges for multiple datasets: df1.merge(df2).merge(df3)


πŸš€ Advanced Example: Complex Data Pipeline

python
# Complete data integration pipeline
result = (
    employees
    .merge(contact, on='emp_id', how='left', suffixes=('', '_contact'))
    .merge(performance, left_on='emp_id', right_on='employee_id', how='left')
    .merge(
        projects.groupby('emp_id').agg(
            total_hours=('hours', 'sum'),
            num_projects=('project', 'count')
        ).reset_index(),
        on='emp_id',
        how='left'
    )
    .fillna({'score': 0, 'rating': 'No Rating', 'total_hours': 0, 'num_projects': 0})
)

print("\nComplete Employee Data Mart:")
print(result.round(2))
Discussion (0)
Login to comment
Dictionary

Add New Word

Dictionary Words
My Notes
Highlights
Select text and click highlight to save
My Vocabulary
Quick Quiz
Settings
Reading Analytics
Today's reading: 0 min
Total read time: 0 min
Words learned: 0
Streak: 0 days
AI Summary

Generating summary...