Pandas Day 3: Data Cleaning — Missing Values, Duplicates & Outlier Capping

Table of Contents

📌 Day 3: Data Cleaning - Missing Values &am… H2

Prerequisite Setup (Dirty Dataset) H3

1️⃣ Count missing values per column using is… H2

Understanding missing values in pandas H3

Basic missing value detection H3

Percentage of missing values H3

Visual missing data summary H3

Check if ANY missing values exist H3

2️⃣ Fill missing numeric values with mean/me… H2

Why fill missing values? H3

Calculate statistics first H3

Method 1: Fill with mean H3

Method 2: Fill with median (better for skewe… H3

Method 3: Fill different columns with differ… H3

Method 4: Forward fill or backward fill H3

When to use mean vs median vs mode H3

3️⃣ Drop rows with any missing values H2

Method 1: Drop rows with ANY missing value H3

Method 2: Drop rows with ALL missing values H3

Method 3: Drop rows with missing values in s… H3

Method 4: Threshold-based dropping H3

4️⃣ Remove duplicate rows and keep first/last H2

Create duplicates first H3

Detect duplicates H3

Remove duplicates - keep first occurrence H3

Remove duplicates - keep last occurrence H3

Remove duplicates - remove all duplicates H3

Find duplicates based on specific columns H3

5️⃣ Replace outliers using quantile caps H2

What are outliers? H3

Detect outliers first H3

Method 1: Cap outliers at percentiles H3

Method 2: Cap using IQR method (1.5*IQR rule) H3

Method 3: Winsorization (replace outliers wi… H3

Practical example: Handle multiple columns w… H3

🎯 Complete Hands-on Checklist (Run this!) H2

❌ Common Mistakes & Solutions H2

📝 Practice Exercises H2

✅ Summary of Day 3 H2

🎓 Pro Tips for Real World H2

Quick Stats

Words: 2,136

Est. time: 11 min

Readability: Advanced

Word Cloud

DIALYZED supersimplify seeking sobriety reuplift permeations unslaked accordingly prankier quintar phonemicizing prognosticate unrealism FLAP-MOUTHED jaquima

📌 Day 3: Data Cleaning - Missing Values & Duplicates

Prerequisite Setup (Dirty Dataset)

import pandas as pd
import numpy as np

# Create a messy dataset with missing values, duplicates, and outliers
np.random.seed(42)  # for reproducibility

df = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry', 'Ivy', 'Jack'],
    'Department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR'],
    'Salary': [50000, 60000, np.nan, 55000, 48000, 80000, 65000, np.nan, 52000, 70000],
    'Age': [25, 30, 35, 28, 22, 40, 33, 29, np.nan, 45],
    'Experience': [2, 5, 8, 3, 1, 12, 6, 4, 2, 15],
    'Performance_Score': [85, 92, 78, 88, 95, 999, 82, 87, 90, 200]  # Outliers!
})

print("Original DataFrame (with issues):")
print(df)
print("\n" + "="*60)

Issues in this dataset:

Missing values (NaN) in Salary (row 2, 7) and Age (row 8)
Outliers in Performance_Score (999, 200)
We'll add duplicates later

1️⃣ Count missing values per column using `isnull().sum()`

Understanding missing values in pandas

Missing values are represented as NaN (Not a Number) or None
isnull() returns True for missing values, False otherwise
sum() counts True as 1, False as 0

Basic missing value detection

# Detect missing values (True/False mask)
print(df.isnull())
print("\n" + "="*60)

# Count missing per column
missing_count = df.isnull().sum()
print("Missing values per column:")
print(missing_count)

Output:

Missing values per column:
Employee              0
Department            0
Salary                2
Age                   1
Experience            0
Performance_Score     0
dtype: int64

Percentage of missing values

# Percentage of missing values
missing_percent = (df.isnull().sum() / len(df)) * 100
print("\nMissing values percentage:")
print(missing_percent)

Output:

Missing values percentage:
Employee              0.0
Department            0.0
Salary               20.0
Age                  10.0
Experience            0.0
Performance_Score     0.0
dtype: float64

Visual missing data summary

# Create a missing data summary DataFrame
missing_summary = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percent': (df.isnull().sum() / len(df)) * 100
})
print("\nMissing Data Summary:")
print(missing_summary)

Check if ANY missing values exist

# Check if there's any missing value in entire DataFrame
print("\nAny missing values in entire DataFrame?", df.isnull().any().any())
print("Any missing values in Salary?", df['Salary'].isnull().any())
print("Rows with ANY missing values:", df.isnull().any(axis=1).sum())

2️⃣ Fill missing numeric values with mean/median

Why fill missing values?

Deleting rows loses data
Many ML algorithms don't accept NaN
Mean/median preserves dataset size

Calculate statistics first

# Calculate mean and median before filling
salary_mean = df['Salary'].mean()
salary_median = df['Salary'].median()
age_mean = df['Age'].mean()
age_median = df['Age'].median()

print(f"Salary Mean: ${salary_mean:,.0f}")
print(f"Salary Median: ${salary_median:,.0f}")
print(f"Age Mean: {age_mean:.1f}")
print(f"Age Median: {age_median:.1f}")

Method 1: Fill with mean

# Create a copy to preserve original
df_filled_mean = df.copy()

# Fill missing Salary with mean
df_filled_mean['Salary'] = df_filled_mean['Salary'].fillna(salary_mean)
df_filled_mean['Age'] = df_filled_mean['Age'].fillna(age_mean)

print("\nDataFrame after filling with MEAN:")
print(df_filled_mean[['Employee', 'Salary', 'Age']])

Method 2: Fill with median (better for skewed data)

# Create another copy
df_filled_median = df.copy()

# Fill missing Salary with median
df_filled_median['Salary'] = df_filled_median['Salary'].fillna(salary_median)
df_filled_median['Age'] = df_filled_median['Age'].fillna(age_median)

print("\nDataFrame after filling with MEDIAN:")
print(df_filled_median[['Employee', 'Salary', 'Age']])

Method 3: Fill different columns with different strategies

df_filled_mixed = df.copy()

# Fill specific columns with different values
fill_values = {
    'Salary': df['Salary'].median(),  # Use median for Salary
    'Age': df['Age'].mean()           # Use mean for Age
}

df_filled_mixed = df_filled_mixed.fillna(fill_values)
print("\nMixed filling strategy:")
print(df_filled_mixed[['Employee', 'Salary', 'Age']])

Method 4: Forward fill or backward fill

# Create data with time series example
time_df = pd.DataFrame({
    'Day': [1, 2, 3, 4, 5],
    'Sales': [100, np.nan, np.nan, 150, 200]
})

print("\nOriginal time series:")
print(time_df)

# Forward fill (use previous valid value)
print("\nForward fill:")
print(time_df.fillna(method='ffill'))

# Backward fill (use next valid value)
print("\nBackward fill:")
print(time_df.fillna(method='bfill'))

When to use mean vs median vs mode

# Mode for categorical data (if needed)
department_mode = df['Department'].mode()[0]
print(f"\nMode of Department: {department_mode}")

3️⃣ Drop rows with any missing values

Method 1: Drop rows with ANY missing value

# Create a clean copy
df_dropped_any = df.copy()

# Drop rows with at least one missing value
df_clean_any = df_dropped_any.dropna()
print(f"Original rows: {len(df)}")
print(f"Rows after dropping ANY missing: {len(df_clean_any)}")
print("\nCleaned DataFrame (no missing values):")
print(df_clean_any)

Method 2: Drop rows with ALL missing values

# Create DataFrame with rows that are all NaN
df_test = pd.DataFrame([
    [1, 2, 3],
    [np.nan, np.nan, np.nan],  # All missing
    [4, np.nan, 6],
    [np.nan, np.nan, np.nan]   # All missing
])

print("\nTest DataFrame:")
print(df_test)

# Drop rows where ALL values are missing
print("\nDrop rows with ALL missing:")
print(df_test.dropna(how='all'))

Method 3: Drop rows with missing values in specific columns

# Drop rows where Salary is missing (critical column)
df_dropped_salary = df.dropna(subset=['Salary'])
print(f"\nRows after dropping missing Salary: {len(df_dropped_salary)}")
print(df_dropped_salary[['Employee', 'Salary']])

# Drop rows where Age OR Salary is missing
df_dropped_both = df.dropna(subset=['Age', 'Salary'])
print(f"\nRows after dropping missing Age OR Salary: {len(df_dropped_both)}")

Method 4: Threshold-based dropping

# Keep rows with at least 4 non-missing values
df_threshold = df.dropna(thresh=4)  # thresh = minimum number of non-NA values
print(f"\nRows after threshold dropping (need 4 non-missing): {len(df_threshold)}")

4️⃣ Remove duplicate rows and keep first/last

Create duplicates first

# Add duplicate rows to our dataset
df_with_dupes = df.copy()

# Create duplicate of first row
duplicate_row = df_with_dupes.iloc[[0]]
df_with_dupes = pd.concat([df_with_dupes, duplicate_row], ignore_index=True)

# Create duplicate of last row
duplicate_last = df_with_dupes.iloc[[-2]]
df_with_dupes = pd.concat([df_with_dupes, duplicate_last], ignore_index=True)

print("\n" + "="*60)
print("DataFrame WITH duplicates:")
print(df_with_dupes)
print(f"\nTotal rows (with duplicates): {len(df_with_dupes)}")

Detect duplicates

# Check for duplicate rows (all columns)
print("\nDuplicate rows (all columns):")
print(df_with_dupes.duplicated())

# Count duplicates
print(f"\nNumber of duplicate rows: {df_with_dupes.duplicated().sum()}")

# Show duplicate rows
print("\nDuplicate rows (displayed):")
print(df_with_dupes[df_with_dupes.duplicated(keep=False)])

Remove duplicates - keep first occurrence

# Remove duplicates, keep first occurrence
df_no_dupes_first = df_with_dupes.drop_duplicates(keep='first')
print(f"\nAfter removing duplicates (keep='first'): {len(df_no_dupes_first)} rows")

Remove duplicates - keep last occurrence

# Remove duplicates, keep last occurrence
df_no_dupes_last = df_with_dupes.drop_duplicates(keep='last')
print(f"After removing duplicates (keep='last'): {len(df_no_dupes_last)} rows")

Remove duplicates - remove all duplicates

# Remove all rows that are duplicates (keep none)
df_no_dupes_none = df_with_dupes.drop_duplicates(keep=False)
print(f"After removing ALL duplicates (keep=False): {len(df_no_dupes_none)} rows")

Find duplicates based on specific columns

# Add some semi-duplicates
df_test = pd.DataFrame({
    'ID': [1, 2, 3, 4, 1, 2],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Alice', 'Bob'],
    'Score': [85, 92, 78, 88, 85, 95]
})

print("\n" + "="*60)
print("Semi-duplicate example:")
print(df_test)

# Find duplicates based only on ID and Name
print("\nDuplicates based on ID and Name columns:")
print(df_test[df_test.duplicated(subset=['ID', 'Name'], keep=False)])

# Remove duplicates based on specific columns
df_clean_semi = df_test.drop_duplicates(subset=['ID', 'Name'], keep='first')
print("\nAfter removing duplicates based on ID and Name:")
print(df_clean_semi)

5️⃣ Replace outliers using quantile caps

What are outliers?

Outliers are extreme values that can skew analysis. Common approach: Cap at 1st and 99th percentiles.

Detect outliers first

# Use our original dataset with outliers in Performance_Score
print("\n" + "="*60)
print("Outlier Detection in Performance_Score:")
print(df[['Employee', 'Performance_Score']])

# Calculate percentiles
Q1 = df['Performance_Score'].quantile(0.25)
Q3 = df['Performance_Score'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"\nQ1 (25th percentile): {Q1}")
print(f"Q3 (75th percentile): {Q3}")
print(f"IQR: {IQR}")
print(f"Lower bound: {lower_bound}")
print(f"Upper bound: {upper_bound}")

# Identify outliers
outliers = df[(df['Performance_Score'] < lower_bound) | (df['Performance_Score'] > upper_bound)]
print(f"\nOutliers detected:")
print(outliers[['Employee', 'Performance_Score']])

Method 1: Cap outliers at percentiles

# Create a copy
df_capped = df.copy()

# Define percentile caps (e.g., cap at 5th and 95th percentile)
lower_cap = df_capped['Performance_Score'].quantile(0.05)
upper_cap = df_capped['Performance_Score'].quantile(0.95)

print(f"\nLower cap (5th percentile): {lower_cap}")
print(f"Upper cap (95th percentile): {upper_cap}")

# Apply caps
df_capped['Performance_Score_Capped'] = df_capped['Performance_Score'].clip(
    lower=lower_cap, 
    upper=upper_cap
)

print("\nBefore vs After capping:")
comparison = df_capped[['Employee', 'Performance_Score', 'Performance_Score_Capped']]
print(comparison)

Method 2: Cap using IQR method (1.5*IQR rule)

df_iqr_capped = df.copy()

# Calculate bounds using IQR method
Q1 = df_iqr_capped['Performance_Score'].quantile(0.25)
Q3 = df_iqr_capped['Performance_Score'].quantile(0.75)
IQR = Q3 - Q1

lower_bound_iqr = Q1 - 1.5 * IQR
upper_bound_iqr = Q3 + 1.5 * IQR

# Cap outliers
df_iqr_capped['Performance_Score_IQR_Capped'] = df_iqr_capped['Performance_Score'].clip(
    lower=lower_bound_iqr,
    upper=upper_bound_iqr
)

print("\nIQR Method Capping:")
print(f"Lower bound: {lower_bound_iqr}")
print(f"Upper bound: {upper_bound_iqr}")
print("\nResults:")
print(df_iqr_capped[['Employee', 'Performance_Score', 'Performance_Score_IQR_Capped']])

Method 3: Winsorization (replace outliers with nearest non-outlier)

# Winsorization at 5% level (cap top and bottom 5%)
def winsorize(series, limits=(0.05, 0.05)):
    """Replace outliers with nearest non-outlier values"""
    lower = series.quantile(limits[0])
    upper = series.quantile(1 - limits[1])
    return series.clip(lower=lower, upper=upper)

df_winsorized = df.copy()
df_winsorized['Performance_Score_Winsorized'] = winsorize(df_winsorized['Performance_Score'])

print("\nWinsorized (5% capping):")
print(df_winsorized[['Employee', 'Performance_Score', 'Performance_Score_Winsorized']])

Practical example: Handle multiple columns with outliers

# Create dataset with multiple outlier columns
df_multi = pd.DataFrame({
    'Product': ['A', 'B', 'C', 'D', 'E'],
    'Price': [100, 150, 120, 1000, 110],  # 1000 is outlier
    'Quantity': [50, 60, 55, 500, 52]      # 500 is outlier
})

print("\n" + "="*60)
print("Multiple Column Outlier Treatment:")
print(df_multi)

# Function to cap outliers in a column
def cap_outliers(series, lower_percentile=0.01, upper_percentile=0.99):
    lower = series.quantile(lower_percentile)
    upper = series.quantile(upper_percentile)
    return series.clip(lower=lower, upper=upper)

# Apply to numeric columns
numeric_cols = df_multi.select_dtypes(include=['int64', 'float64']).columns
df_cleaned = df_multi.copy()

for col in numeric_cols:
    df_cleaned[f'{col}_capped'] = cap_outliers(df_multi[col])

print("\nBefore vs After:")
print(df_cleaned)

🎯 Complete Hands-on Checklist (Run this!)

import pandas as pd
import numpy as np

# Load messy dataset
df_messy = pd.DataFrame({
    'ID': [1, 2, 3, 4, 5, 1, 2, 6, 7, 8],
    'Name': ['John', 'Jane', 'Bob', 'Alice', 'Charlie', 'John', 'Jane', 'David', 'Emma', 'Frank'],
    'Age': [25, 30, np.nan, 28, 35, 25, 30, 40, 29, 32],
    'Salary': [50000, 60000, 55000, np.nan, 70000, 50000, 60000, 120000, 58000, 62000],
    'Score': [85, 92, 78, 88, 95, 85, 92, 98, 87, 300]  # Has outlier
})

print("="*60)
print("STEP 1: Count missing values")
print("="*60)
print(df_messy.isnull().sum())
print(f"\nPercentage missing:\n{(df_messy.isnull().sum() / len(df_messy)) * 100}")

print("\n" + "="*60)
print("STEP 2: Fill missing values")
print("="*60)
df_filled = df_messy.copy()
df_filled['Age'] = df_filled['Age'].fillna(df_filled['Age'].median())
df_filled['Salary'] = df_filled['Salary'].fillna(df_filled['Salary'].mean())
print("After filling:")
print(df_filled[['ID', 'Age', 'Salary']])

print("\n" + "="*60)
print("STEP 3: Drop rows with missing values")
print("="*60)
df_dropped = df_messy.dropna()
print(f"Original: {len(df_messy)} rows, After drop: {len(df_dropped)} rows")

print("\n" + "="*60)
print("STEP 4: Remove duplicates")
print("="*60)
print(f"Duplicate count: {df_messy.duplicated().sum()}")
df_unique = df_messy.drop_duplicates(keep='first')
print(f"After removing duplicates: {len(df_unique)} rows")

print("\n" + "="*60)
print("STEP 5: Replace outliers")
print("="*60)
print("Original Score column stats:")
print(df_messy['Score'].describe())

# Cap at 95th percentile
cap = df_messy['Score'].quantile(0.95)
df_messy['Score_Capped'] = df_messy['Score'].clip(upper=cap)
print(f"\nCapped at 95th percentile ({cap}):")
print(df_messy[['Score', 'Score_Capped']])

❌ Common Mistakes & Solutions

Mistake	Why it fails	Correct way
`df.fillna(df.mean())`	Inplace operation not specified	`df.fillna(df.mean(), inplace=True)` or assign to new variable
Using mean when data has outliers	Mean is skewed by outliers	Use median for skewed distributions
`df.dropna()` without assigning	Returns new DataFrame, doesn't modify original	`df_clean = df.dropna()` or `df.dropna(inplace=True)`
Confusing `duplicated()` and `drop_duplicates()`	One detects, one removes	`df.duplicated()` returns boolean mask; `df.drop_duplicates()` removes
Capping before analyzing distribution	Might hide important patterns	Always visualize before capping

📝 Practice Exercises

Load a dataset with missing values and calculate the percentage of missing values in each column.
Fill missing values differently: numeric with median, categorical with mode.
Remove rows where more than 2 columns have missing values.
Find and remove duplicate rows based on specific columns (e.g., 'ID' and 'Name').
Cap outliers in a column at the 1st and 99th percentiles.

✅ Summary of Day 3

Task	Method	Key Parameters
Count missing	`df.isnull().sum()`	None
Fill missing	`df.fillna(value)`	`value`, `method='ffill'`
Drop missing	`df.dropna()`	`how='any'/'all'`, `subset`
Find duplicates	`df.duplicated()`	`keep='first'/'last'/False`
Remove duplicates	`df.drop_duplicates()`	`keep`, `subset`
Cap outliers	`df['col'].clip(lower, upper)`	`lower`, `upper`

🎓 Pro Tips for Real World

Always check missing data pattern - Is it random or systematic?
Document your cleaning steps - Future you will thank you
Keep original data - Work on copies, never modify original
Visualize before/after - Use histograms to see outlier impact
Consider domain knowledge - Sometimes 999 is a valid code, not outlier

Pandas Day 3 — Data Cleaning: Missing Values & Duplicates