MadiisAttendance
Data Cleaning & Preprocessing in Pandas 11 min read 2,136 words 175 sentences uni

Pandas Day 3 β€” Data Cleaning: Missing Values & Duplicates

M Usman May 26, 2026
6 0 0 score
0%
Your Progress
0/40 sections
Reading Speed
0
words/min
Time Spent
00:00
Table of Contents

Quick Stats
Words: 2,136
Est. time: 11 min
Readability: Advanced

Word Cloud
DIALYZED supersimplify seeking sobriety reuplift permeations unslaked accordingly prankier quintar phonemicizing prognosticate unrealism FLAP-MOUTHED jaquima

πŸ“Œ Day 3: Data Cleaning - Missing Values & Duplicates

Prerequisite Setup (Dirty Dataset)

python
import pandas as pd
import numpy as np

# Create a messy dataset with missing values, duplicates, and outliers
np.random.seed(42)  # for reproducibility

df = pd.DataFrame({
    'Employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry', 'Ivy', 'Jack'],
    'Department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR'],
    'Salary': [50000, 60000, np.nan, 55000, 48000, 80000, 65000, np.nan, 52000, 70000],
    'Age': [25, 30, 35, 28, 22, 40, 33, 29, np.nan, 45],
    'Experience': [2, 5, 8, 3, 1, 12, 6, 4, 2, 15],
    'Performance_Score': [85, 92, 78, 88, 95, 999, 82, 87, 90, 200]  # Outliers!
})

print("Original DataFrame (with issues):")
print(df)
print("\n" + "="*60)

Issues in this dataset:

  • Missing values (NaN) in Salary (row 2, 7) and Age (row 8)

  • Outliers in Performance_Score (999, 200)

  • We'll add duplicates later


1️⃣ Count missing values per column using isnull().sum()

Understanding missing values in pandas

  • Missing values are represented as NaN (Not a Number) or None

  • isnull() returns True for missing values, False otherwise

  • sum() counts True as 1, False as 0

Basic missing value detection

python
# Detect missing values (True/False mask)
print(df.isnull())
print("\n" + "="*60)

# Count missing per column
missing_count = df.isnull().sum()
print("Missing values per column:")
print(missing_count)

Output:

text
Missing values per column:
Employee              0
Department            0
Salary                2
Age                   1
Experience            0
Performance_Score     0
dtype: int64

Percentage of missing values

python
# Percentage of missing values
missing_percent = (df.isnull().sum() / len(df)) * 100
print("\nMissing values percentage:")
print(missing_percent)

Output:

text
Missing values percentage:
Employee              0.0
Department            0.0
Salary               20.0
Age                  10.0
Experience            0.0
Performance_Score     0.0
dtype: float64

Visual missing data summary

python
# Create a missing data summary DataFrame
missing_summary = pd.DataFrame({
    'Column': df.columns,
    'Missing_Count': df.isnull().sum(),
    'Missing_Percent': (df.isnull().sum() / len(df)) * 100
})
print("\nMissing Data Summary:")
print(missing_summary)

Check if ANY missing values exist

python
# Check if there's any missing value in entire DataFrame
print("\nAny missing values in entire DataFrame?", df.isnull().any().any())
print("Any missing values in Salary?", df['Salary'].isnull().any())
print("Rows with ANY missing values:", df.isnull().any(axis=1).sum())

2️⃣ Fill missing numeric values with mean/median

Why fill missing values?

  • Deleting rows loses data

  • Many ML algorithms don't accept NaN

  • Mean/median preserves dataset size

Calculate statistics first

python
# Calculate mean and median before filling
salary_mean = df['Salary'].mean()
salary_median = df['Salary'].median()
age_mean = df['Age'].mean()
age_median = df['Age'].median()

print(f"Salary Mean: ${salary_mean:,.0f}")
print(f"Salary Median: ${salary_median:,.0f}")
print(f"Age Mean: {age_mean:.1f}")
print(f"Age Median: {age_median:.1f}")

Method 1: Fill with mean

python
# Create a copy to preserve original
df_filled_mean = df.copy()

# Fill missing Salary with mean
df_filled_mean['Salary'] = df_filled_mean['Salary'].fillna(salary_mean)
df_filled_mean['Age'] = df_filled_mean['Age'].fillna(age_mean)

print("\nDataFrame after filling with MEAN:")
print(df_filled_mean[['Employee', 'Salary', 'Age']])

Method 2: Fill with median (better for skewed data)

python
# Create another copy
df_filled_median = df.copy()

# Fill missing Salary with median
df_filled_median['Salary'] = df_filled_median['Salary'].fillna(salary_median)
df_filled_median['Age'] = df_filled_median['Age'].fillna(age_median)

print("\nDataFrame after filling with MEDIAN:")
print(df_filled_median[['Employee', 'Salary', 'Age']])

Method 3: Fill different columns with different strategies

python
df_filled_mixed = df.copy()

# Fill specific columns with different values
fill_values = {
    'Salary': df['Salary'].median(),  # Use median for Salary
    'Age': df['Age'].mean()           # Use mean for Age
}

df_filled_mixed = df_filled_mixed.fillna(fill_values)
print("\nMixed filling strategy:")
print(df_filled_mixed[['Employee', 'Salary', 'Age']])

Method 4: Forward fill or backward fill

python
# Create data with time series example
time_df = pd.DataFrame({
    'Day': [1, 2, 3, 4, 5],
    'Sales': [100, np.nan, np.nan, 150, 200]
})

print("\nOriginal time series:")
print(time_df)

# Forward fill (use previous valid value)
print("\nForward fill:")
print(time_df.fillna(method='ffill'))

# Backward fill (use next valid value)
print("\nBackward fill:")
print(time_df.fillna(method='bfill'))

When to use mean vs median vs mode

python
# Mode for categorical data (if needed)
department_mode = df['Department'].mode()[0]
print(f"\nMode of Department: {department_mode}")

3️⃣ Drop rows with any missing values

Method 1: Drop rows with ANY missing value

python
# Create a clean copy
df_dropped_any = df.copy()

# Drop rows with at least one missing value
df_clean_any = df_dropped_any.dropna()
print(f"Original rows: {len(df)}")
print(f"Rows after dropping ANY missing: {len(df_clean_any)}")
print("\nCleaned DataFrame (no missing values):")
print(df_clean_any)

Method 2: Drop rows with ALL missing values

python
# Create DataFrame with rows that are all NaN
df_test = pd.DataFrame([
    [1, 2, 3],
    [np.nan, np.nan, np.nan],  # All missing
    [4, np.nan, 6],
    [np.nan, np.nan, np.nan]   # All missing
])

print("\nTest DataFrame:")
print(df_test)

# Drop rows where ALL values are missing
print("\nDrop rows with ALL missing:")
print(df_test.dropna(how='all'))

Method 3: Drop rows with missing values in specific columns

python
# Drop rows where Salary is missing (critical column)
df_dropped_salary = df.dropna(subset=['Salary'])
print(f"\nRows after dropping missing Salary: {len(df_dropped_salary)}")
print(df_dropped_salary[['Employee', 'Salary']])

# Drop rows where Age OR Salary is missing
df_dropped_both = df.dropna(subset=['Age', 'Salary'])
print(f"\nRows after dropping missing Age OR Salary: {len(df_dropped_both)}")

Method 4: Threshold-based dropping

python
# Keep rows with at least 4 non-missing values
df_threshold = df.dropna(thresh=4)  # thresh = minimum number of non-NA values
print(f"\nRows after threshold dropping (need 4 non-missing): {len(df_threshold)}")

4️⃣ Remove duplicate rows and keep first/last

Create duplicates first

python
# Add duplicate rows to our dataset
df_with_dupes = df.copy()

# Create duplicate of first row
duplicate_row = df_with_dupes.iloc[[0]]
df_with_dupes = pd.concat([df_with_dupes, duplicate_row], ignore_index=True)

# Create duplicate of last row
duplicate_last = df_with_dupes.iloc[[-2]]
df_with_dupes = pd.concat([df_with_dupes, duplicate_last], ignore_index=True)

print("\n" + "="*60)
print("DataFrame WITH duplicates:")
print(df_with_dupes)
print(f"\nTotal rows (with duplicates): {len(df_with_dupes)}")

Detect duplicates

python
# Check for duplicate rows (all columns)
print("\nDuplicate rows (all columns):")
print(df_with_dupes.duplicated())

# Count duplicates
print(f"\nNumber of duplicate rows: {df_with_dupes.duplicated().sum()}")

# Show duplicate rows
print("\nDuplicate rows (displayed):")
print(df_with_dupes[df_with_dupes.duplicated(keep=False)])

Remove duplicates - keep first occurrence

python
# Remove duplicates, keep first occurrence
df_no_dupes_first = df_with_dupes.drop_duplicates(keep='first')
print(f"\nAfter removing duplicates (keep='first'): {len(df_no_dupes_first)} rows")

Remove duplicates - keep last occurrence

python
# Remove duplicates, keep last occurrence
df_no_dupes_last = df_with_dupes.drop_duplicates(keep='last')
print(f"After removing duplicates (keep='last'): {len(df_no_dupes_last)} rows")

Remove duplicates - remove all duplicates

python
# Remove all rows that are duplicates (keep none)
df_no_dupes_none = df_with_dupes.drop_duplicates(keep=False)
print(f"After removing ALL duplicates (keep=False): {len(df_no_dupes_none)} rows")

Find duplicates based on specific columns

python
# Add some semi-duplicates
df_test = pd.DataFrame({
    'ID': [1, 2, 3, 4, 1, 2],
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Alice', 'Bob'],
    'Score': [85, 92, 78, 88, 85, 95]
})

print("\n" + "="*60)
print("Semi-duplicate example:")
print(df_test)

# Find duplicates based only on ID and Name
print("\nDuplicates based on ID and Name columns:")
print(df_test[df_test.duplicated(subset=['ID', 'Name'], keep=False)])

# Remove duplicates based on specific columns
df_clean_semi = df_test.drop_duplicates(subset=['ID', 'Name'], keep='first')
print("\nAfter removing duplicates based on ID and Name:")
print(df_clean_semi)

5️⃣ Replace outliers using quantile caps

What are outliers?

Outliers are extreme values that can skew analysis. Common approach: Cap at 1st and 99th percentiles.

Detect outliers first

python
# Use our original dataset with outliers in Performance_Score
print("\n" + "="*60)
print("Outlier Detection in Performance_Score:")
print(df[['Employee', 'Performance_Score']])

# Calculate percentiles
Q1 = df['Performance_Score'].quantile(0.25)
Q3 = df['Performance_Score'].quantile(0.75)
IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"\nQ1 (25th percentile): {Q1}")
print(f"Q3 (75th percentile): {Q3}")
print(f"IQR: {IQR}")
print(f"Lower bound: {lower_bound}")
print(f"Upper bound: {upper_bound}")

# Identify outliers
outliers = df[(df['Performance_Score'] < lower_bound) | (df['Performance_Score'] > upper_bound)]
print(f"\nOutliers detected:")
print(outliers[['Employee', 'Performance_Score']])

Method 1: Cap outliers at percentiles

python
# Create a copy
df_capped = df.copy()

# Define percentile caps (e.g., cap at 5th and 95th percentile)
lower_cap = df_capped['Performance_Score'].quantile(0.05)
upper_cap = df_capped['Performance_Score'].quantile(0.95)

print(f"\nLower cap (5th percentile): {lower_cap}")
print(f"Upper cap (95th percentile): {upper_cap}")

# Apply caps
df_capped['Performance_Score_Capped'] = df_capped['Performance_Score'].clip(
    lower=lower_cap, 
    upper=upper_cap
)

print("\nBefore vs After capping:")
comparison = df_capped[['Employee', 'Performance_Score', 'Performance_Score_Capped']]
print(comparison)

Method 2: Cap using IQR method (1.5*IQR rule)

python
df_iqr_capped = df.copy()

# Calculate bounds using IQR method
Q1 = df_iqr_capped['Performance_Score'].quantile(0.25)
Q3 = df_iqr_capped['Performance_Score'].quantile(0.75)
IQR = Q3 - Q1

lower_bound_iqr = Q1 - 1.5 * IQR
upper_bound_iqr = Q3 + 1.5 * IQR

# Cap outliers
df_iqr_capped['Performance_Score_IQR_Capped'] = df_iqr_capped['Performance_Score'].clip(
    lower=lower_bound_iqr,
    upper=upper_bound_iqr
)

print("\nIQR Method Capping:")
print(f"Lower bound: {lower_bound_iqr}")
print(f"Upper bound: {upper_bound_iqr}")
print("\nResults:")
print(df_iqr_capped[['Employee', 'Performance_Score', 'Performance_Score_IQR_Capped']])

Method 3: Winsorization (replace outliers with nearest non-outlier)

python
# Winsorization at 5% level (cap top and bottom 5%)
def winsorize(series, limits=(0.05, 0.05)):
    """Replace outliers with nearest non-outlier values"""
    lower = series.quantile(limits[0])
    upper = series.quantile(1 - limits[1])
    return series.clip(lower=lower, upper=upper)

df_winsorized = df.copy()
df_winsorized['Performance_Score_Winsorized'] = winsorize(df_winsorized['Performance_Score'])

print("\nWinsorized (5% capping):")
print(df_winsorized[['Employee', 'Performance_Score', 'Performance_Score_Winsorized']])

Practical example: Handle multiple columns with outliers

python
# Create dataset with multiple outlier columns
df_multi = pd.DataFrame({
    'Product': ['A', 'B', 'C', 'D', 'E'],
    'Price': [100, 150, 120, 1000, 110],  # 1000 is outlier
    'Quantity': [50, 60, 55, 500, 52]      # 500 is outlier
})

print("\n" + "="*60)
print("Multiple Column Outlier Treatment:")
print(df_multi)

# Function to cap outliers in a column
def cap_outliers(series, lower_percentile=0.01, upper_percentile=0.99):
    lower = series.quantile(lower_percentile)
    upper = series.quantile(upper_percentile)
    return series.clip(lower=lower, upper=upper)

# Apply to numeric columns
numeric_cols = df_multi.select_dtypes(include=['int64', 'float64']).columns
df_cleaned = df_multi.copy()

for col in numeric_cols:
    df_cleaned[f'{col}_capped'] = cap_outliers(df_multi[col])

print("\nBefore vs After:")
print(df_cleaned)

🎯 Complete Hands-on Checklist (Run this!)

python
import pandas as pd
import numpy as np

# Load messy dataset
df_messy = pd.DataFrame({
    'ID': [1, 2, 3, 4, 5, 1, 2, 6, 7, 8],
    'Name': ['John', 'Jane', 'Bob', 'Alice', 'Charlie', 'John', 'Jane', 'David', 'Emma', 'Frank'],
    'Age': [25, 30, np.nan, 28, 35, 25, 30, 40, 29, 32],
    'Salary': [50000, 60000, 55000, np.nan, 70000, 50000, 60000, 120000, 58000, 62000],
    'Score': [85, 92, 78, 88, 95, 85, 92, 98, 87, 300]  # Has outlier
})

print("="*60)
print("STEP 1: Count missing values")
print("="*60)
print(df_messy.isnull().sum())
print(f"\nPercentage missing:\n{(df_messy.isnull().sum() / len(df_messy)) * 100}")

print("\n" + "="*60)
print("STEP 2: Fill missing values")
print("="*60)
df_filled = df_messy.copy()
df_filled['Age'] = df_filled['Age'].fillna(df_filled['Age'].median())
df_filled['Salary'] = df_filled['Salary'].fillna(df_filled['Salary'].mean())
print("After filling:")
print(df_filled[['ID', 'Age', 'Salary']])

print("\n" + "="*60)
print("STEP 3: Drop rows with missing values")
print("="*60)
df_dropped = df_messy.dropna()
print(f"Original: {len(df_messy)} rows, After drop: {len(df_dropped)} rows")

print("\n" + "="*60)
print("STEP 4: Remove duplicates")
print("="*60)
print(f"Duplicate count: {df_messy.duplicated().sum()}")
df_unique = df_messy.drop_duplicates(keep='first')
print(f"After removing duplicates: {len(df_unique)} rows")

print("\n" + "="*60)
print("STEP 5: Replace outliers")
print("="*60)
print("Original Score column stats:")
print(df_messy['Score'].describe())

# Cap at 95th percentile
cap = df_messy['Score'].quantile(0.95)
df_messy['Score_Capped'] = df_messy['Score'].clip(upper=cap)
print(f"\nCapped at 95th percentile ({cap}):")
print(df_messy[['Score', 'Score_Capped']])

❌ Common Mistakes & Solutions

MistakeWhy it failsCorrect way
df.fillna(df.mean())Inplace operation not specifieddf.fillna(df.mean(), inplace=True) or assign to new variable
Using mean when data has outliersMean is skewed by outliersUse median for skewed distributions
df.dropna() without assigningReturns new DataFrame, doesn't modify originaldf_clean = df.dropna() or df.dropna(inplace=True)
Confusing duplicated() and drop_duplicates()One detects, one removesdf.duplicated() returns boolean mask; df.drop_duplicates() removes
Capping before analyzing distributionMight hide important patternsAlways visualize before capping

πŸ“ Practice Exercises

  1. Load a dataset with missing values and calculate the percentage of missing values in each column.

  2. Fill missing values differently: numeric with median, categorical with mode.

  3. Remove rows where more than 2 columns have missing values.

  4. Find and remove duplicate rows based on specific columns (e.g., 'ID' and 'Name').

  5. Cap outliers in a column at the 1st and 99th percentiles.


βœ… Summary of Day 3

TaskMethodKey Parameters
Count missingdf.isnull().sum()None
Fill missingdf.fillna(value)value, method='ffill'
Drop missingdf.dropna()how='any'/'all', subset
Find duplicatesdf.duplicated()keep='first'/'last'/False
Remove duplicatesdf.drop_duplicates()keep, subset
Cap outliersdf['col'].clip(lower, upper)lower, upper

πŸŽ“ Pro Tips for Real World

  1. Always check missing data pattern - Is it random or systematic?

  2. Document your cleaning steps - Future you will thank you

  3. Keep original data - Work on copies, never modify original

  4. Visualize before/after - Use histograms to see outlier impact

  5. Consider domain knowledge - Sometimes 999 is a valid code, not outlier

Discussion (0)
Login to comment
Dictionary

Add New Word

Dictionary Words
My Notes
Highlights
Select text and click highlight to save
My Vocabulary
Quick Quiz
Settings
Reading Analytics
Today's reading: 0 min
Total read time: 0 min
Words learned: 0
Streak: 0 days
AI Summary

Generating summary...