π Day 3: Data Cleaning - Missing Values & Duplicates
Prerequisite Setup (Dirty Dataset)
import pandas as pd import numpy as np # Create a messy dataset with missing values, duplicates, and outliers np.random.seed(42) # for reproducibility df = pd.DataFrame({ 'Employee': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry', 'Ivy', 'Jack'], 'Department': ['Sales', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR', 'IT', 'Sales', 'HR'], 'Salary': [50000, 60000, np.nan, 55000, 48000, 80000, 65000, np.nan, 52000, 70000], 'Age': [25, 30, 35, 28, 22, 40, 33, 29, np.nan, 45], 'Experience': [2, 5, 8, 3, 1, 12, 6, 4, 2, 15], 'Performance_Score': [85, 92, 78, 88, 95, 999, 82, 87, 90, 200] # Outliers! }) print("Original DataFrame (with issues):") print(df) print("\n" + "="*60)
Issues in this dataset:
Missing values (NaN) in Salary (row 2, 7) and Age (row 8)
Outliers in Performance_Score (999, 200)
We'll add duplicates later
1οΈβ£ Count missing values per column using isnull().sum()
Understanding missing values in pandas
Missing values are represented as
NaN(Not a Number) orNoneisnull()returns True for missing values, False otherwisesum()counts True as 1, False as 0
Basic missing value detection
# Detect missing values (True/False mask) print(df.isnull()) print("\n" + "="*60) # Count missing per column missing_count = df.isnull().sum() print("Missing values per column:") print(missing_count)
Output:
Missing values per column: Employee 0 Department 0 Salary 2 Age 1 Experience 0 Performance_Score 0 dtype: int64
Percentage of missing values
# Percentage of missing values missing_percent = (df.isnull().sum() / len(df)) * 100 print("\nMissing values percentage:") print(missing_percent)
Output:
Missing values percentage: Employee 0.0 Department 0.0 Salary 20.0 Age 10.0 Experience 0.0 Performance_Score 0.0 dtype: float64
Visual missing data summary
# Create a missing data summary DataFrame missing_summary = pd.DataFrame({ 'Column': df.columns, 'Missing_Count': df.isnull().sum(), 'Missing_Percent': (df.isnull().sum() / len(df)) * 100 }) print("\nMissing Data Summary:") print(missing_summary)
Check if ANY missing values exist
# Check if there's any missing value in entire DataFrame print("\nAny missing values in entire DataFrame?", df.isnull().any().any()) print("Any missing values in Salary?", df['Salary'].isnull().any()) print("Rows with ANY missing values:", df.isnull().any(axis=1).sum())
2οΈβ£ Fill missing numeric values with mean/median
Why fill missing values?
Deleting rows loses data
Many ML algorithms don't accept NaN
Mean/median preserves dataset size
Calculate statistics first
# Calculate mean and median before filling salary_mean = df['Salary'].mean() salary_median = df['Salary'].median() age_mean = df['Age'].mean() age_median = df['Age'].median() print(f"Salary Mean: ${salary_mean:,.0f}") print(f"Salary Median: ${salary_median:,.0f}") print(f"Age Mean: {age_mean:.1f}") print(f"Age Median: {age_median:.1f}")
Method 1: Fill with mean
# Create a copy to preserve original df_filled_mean = df.copy() # Fill missing Salary with mean df_filled_mean['Salary'] = df_filled_mean['Salary'].fillna(salary_mean) df_filled_mean['Age'] = df_filled_mean['Age'].fillna(age_mean) print("\nDataFrame after filling with MEAN:") print(df_filled_mean[['Employee', 'Salary', 'Age']])
Method 2: Fill with median (better for skewed data)
# Create another copy df_filled_median = df.copy() # Fill missing Salary with median df_filled_median['Salary'] = df_filled_median['Salary'].fillna(salary_median) df_filled_median['Age'] = df_filled_median['Age'].fillna(age_median) print("\nDataFrame after filling with MEDIAN:") print(df_filled_median[['Employee', 'Salary', 'Age']])
Method 3: Fill different columns with different strategies
df_filled_mixed = df.copy() # Fill specific columns with different values fill_values = { 'Salary': df['Salary'].median(), # Use median for Salary 'Age': df['Age'].mean() # Use mean for Age } df_filled_mixed = df_filled_mixed.fillna(fill_values) print("\nMixed filling strategy:") print(df_filled_mixed[['Employee', 'Salary', 'Age']])
Method 4: Forward fill or backward fill
# Create data with time series example time_df = pd.DataFrame({ 'Day': [1, 2, 3, 4, 5], 'Sales': [100, np.nan, np.nan, 150, 200] }) print("\nOriginal time series:") print(time_df) # Forward fill (use previous valid value) print("\nForward fill:") print(time_df.fillna(method='ffill')) # Backward fill (use next valid value) print("\nBackward fill:") print(time_df.fillna(method='bfill'))
When to use mean vs median vs mode
# Mode for categorical data (if needed) department_mode = df['Department'].mode()[0] print(f"\nMode of Department: {department_mode}")
3οΈβ£ Drop rows with any missing values
Method 1: Drop rows with ANY missing value
# Create a clean copy df_dropped_any = df.copy() # Drop rows with at least one missing value df_clean_any = df_dropped_any.dropna() print(f"Original rows: {len(df)}") print(f"Rows after dropping ANY missing: {len(df_clean_any)}") print("\nCleaned DataFrame (no missing values):") print(df_clean_any)
Method 2: Drop rows with ALL missing values
# Create DataFrame with rows that are all NaN df_test = pd.DataFrame([ [1, 2, 3], [np.nan, np.nan, np.nan], # All missing [4, np.nan, 6], [np.nan, np.nan, np.nan] # All missing ]) print("\nTest DataFrame:") print(df_test) # Drop rows where ALL values are missing print("\nDrop rows with ALL missing:") print(df_test.dropna(how='all'))
Method 3: Drop rows with missing values in specific columns
# Drop rows where Salary is missing (critical column) df_dropped_salary = df.dropna(subset=['Salary']) print(f"\nRows after dropping missing Salary: {len(df_dropped_salary)}") print(df_dropped_salary[['Employee', 'Salary']]) # Drop rows where Age OR Salary is missing df_dropped_both = df.dropna(subset=['Age', 'Salary']) print(f"\nRows after dropping missing Age OR Salary: {len(df_dropped_both)}")
Method 4: Threshold-based dropping
# Keep rows with at least 4 non-missing values df_threshold = df.dropna(thresh=4) # thresh = minimum number of non-NA values print(f"\nRows after threshold dropping (need 4 non-missing): {len(df_threshold)}")
4οΈβ£ Remove duplicate rows and keep first/last
Create duplicates first
# Add duplicate rows to our dataset df_with_dupes = df.copy() # Create duplicate of first row duplicate_row = df_with_dupes.iloc[[0]] df_with_dupes = pd.concat([df_with_dupes, duplicate_row], ignore_index=True) # Create duplicate of last row duplicate_last = df_with_dupes.iloc[[-2]] df_with_dupes = pd.concat([df_with_dupes, duplicate_last], ignore_index=True) print("\n" + "="*60) print("DataFrame WITH duplicates:") print(df_with_dupes) print(f"\nTotal rows (with duplicates): {len(df_with_dupes)}")
Detect duplicates
# Check for duplicate rows (all columns) print("\nDuplicate rows (all columns):") print(df_with_dupes.duplicated()) # Count duplicates print(f"\nNumber of duplicate rows: {df_with_dupes.duplicated().sum()}") # Show duplicate rows print("\nDuplicate rows (displayed):") print(df_with_dupes[df_with_dupes.duplicated(keep=False)])
Remove duplicates - keep first occurrence
# Remove duplicates, keep first occurrence df_no_dupes_first = df_with_dupes.drop_duplicates(keep='first') print(f"\nAfter removing duplicates (keep='first'): {len(df_no_dupes_first)} rows")
Remove duplicates - keep last occurrence
# Remove duplicates, keep last occurrence df_no_dupes_last = df_with_dupes.drop_duplicates(keep='last') print(f"After removing duplicates (keep='last'): {len(df_no_dupes_last)} rows")
Remove duplicates - remove all duplicates
# Remove all rows that are duplicates (keep none) df_no_dupes_none = df_with_dupes.drop_duplicates(keep=False) print(f"After removing ALL duplicates (keep=False): {len(df_no_dupes_none)} rows")
Find duplicates based on specific columns
# Add some semi-duplicates df_test = pd.DataFrame({ 'ID': [1, 2, 3, 4, 1, 2], 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Alice', 'Bob'], 'Score': [85, 92, 78, 88, 85, 95] }) print("\n" + "="*60) print("Semi-duplicate example:") print(df_test) # Find duplicates based only on ID and Name print("\nDuplicates based on ID and Name columns:") print(df_test[df_test.duplicated(subset=['ID', 'Name'], keep=False)]) # Remove duplicates based on specific columns df_clean_semi = df_test.drop_duplicates(subset=['ID', 'Name'], keep='first') print("\nAfter removing duplicates based on ID and Name:") print(df_clean_semi)
5οΈβ£ Replace outliers using quantile caps
What are outliers?
Outliers are extreme values that can skew analysis. Common approach: Cap at 1st and 99th percentiles.
Detect outliers first
# Use our original dataset with outliers in Performance_Score print("\n" + "="*60) print("Outlier Detection in Performance_Score:") print(df[['Employee', 'Performance_Score']]) # Calculate percentiles Q1 = df['Performance_Score'].quantile(0.25) Q3 = df['Performance_Score'].quantile(0.75) IQR = Q3 - Q1 lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR print(f"\nQ1 (25th percentile): {Q1}") print(f"Q3 (75th percentile): {Q3}") print(f"IQR: {IQR}") print(f"Lower bound: {lower_bound}") print(f"Upper bound: {upper_bound}") # Identify outliers outliers = df[(df['Performance_Score'] < lower_bound) | (df['Performance_Score'] > upper_bound)] print(f"\nOutliers detected:") print(outliers[['Employee', 'Performance_Score']])
Method 1: Cap outliers at percentiles
# Create a copy df_capped = df.copy() # Define percentile caps (e.g., cap at 5th and 95th percentile) lower_cap = df_capped['Performance_Score'].quantile(0.05) upper_cap = df_capped['Performance_Score'].quantile(0.95) print(f"\nLower cap (5th percentile): {lower_cap}") print(f"Upper cap (95th percentile): {upper_cap}") # Apply caps df_capped['Performance_Score_Capped'] = df_capped['Performance_Score'].clip( lower=lower_cap, upper=upper_cap ) print("\nBefore vs After capping:") comparison = df_capped[['Employee', 'Performance_Score', 'Performance_Score_Capped']] print(comparison)
Method 2: Cap using IQR method (1.5*IQR rule)
df_iqr_capped = df.copy() # Calculate bounds using IQR method Q1 = df_iqr_capped['Performance_Score'].quantile(0.25) Q3 = df_iqr_capped['Performance_Score'].quantile(0.75) IQR = Q3 - Q1 lower_bound_iqr = Q1 - 1.5 * IQR upper_bound_iqr = Q3 + 1.5 * IQR # Cap outliers df_iqr_capped['Performance_Score_IQR_Capped'] = df_iqr_capped['Performance_Score'].clip( lower=lower_bound_iqr, upper=upper_bound_iqr ) print("\nIQR Method Capping:") print(f"Lower bound: {lower_bound_iqr}") print(f"Upper bound: {upper_bound_iqr}") print("\nResults:") print(df_iqr_capped[['Employee', 'Performance_Score', 'Performance_Score_IQR_Capped']])
Method 3: Winsorization (replace outliers with nearest non-outlier)
# Winsorization at 5% level (cap top and bottom 5%) def winsorize(series, limits=(0.05, 0.05)): """Replace outliers with nearest non-outlier values""" lower = series.quantile(limits[0]) upper = series.quantile(1 - limits[1]) return series.clip(lower=lower, upper=upper) df_winsorized = df.copy() df_winsorized['Performance_Score_Winsorized'] = winsorize(df_winsorized['Performance_Score']) print("\nWinsorized (5% capping):") print(df_winsorized[['Employee', 'Performance_Score', 'Performance_Score_Winsorized']])
Practical example: Handle multiple columns with outliers
# Create dataset with multiple outlier columns df_multi = pd.DataFrame({ 'Product': ['A', 'B', 'C', 'D', 'E'], 'Price': [100, 150, 120, 1000, 110], # 1000 is outlier 'Quantity': [50, 60, 55, 500, 52] # 500 is outlier }) print("\n" + "="*60) print("Multiple Column Outlier Treatment:") print(df_multi) # Function to cap outliers in a column def cap_outliers(series, lower_percentile=0.01, upper_percentile=0.99): lower = series.quantile(lower_percentile) upper = series.quantile(upper_percentile) return series.clip(lower=lower, upper=upper) # Apply to numeric columns numeric_cols = df_multi.select_dtypes(include=['int64', 'float64']).columns df_cleaned = df_multi.copy() for col in numeric_cols: df_cleaned[f'{col}_capped'] = cap_outliers(df_multi[col]) print("\nBefore vs After:") print(df_cleaned)
π― Complete Hands-on Checklist (Run this!)
import pandas as pd import numpy as np # Load messy dataset df_messy = pd.DataFrame({ 'ID': [1, 2, 3, 4, 5, 1, 2, 6, 7, 8], 'Name': ['John', 'Jane', 'Bob', 'Alice', 'Charlie', 'John', 'Jane', 'David', 'Emma', 'Frank'], 'Age': [25, 30, np.nan, 28, 35, 25, 30, 40, 29, 32], 'Salary': [50000, 60000, 55000, np.nan, 70000, 50000, 60000, 120000, 58000, 62000], 'Score': [85, 92, 78, 88, 95, 85, 92, 98, 87, 300] # Has outlier }) print("="*60) print("STEP 1: Count missing values") print("="*60) print(df_messy.isnull().sum()) print(f"\nPercentage missing:\n{(df_messy.isnull().sum() / len(df_messy)) * 100}") print("\n" + "="*60) print("STEP 2: Fill missing values") print("="*60) df_filled = df_messy.copy() df_filled['Age'] = df_filled['Age'].fillna(df_filled['Age'].median()) df_filled['Salary'] = df_filled['Salary'].fillna(df_filled['Salary'].mean()) print("After filling:") print(df_filled[['ID', 'Age', 'Salary']]) print("\n" + "="*60) print("STEP 3: Drop rows with missing values") print("="*60) df_dropped = df_messy.dropna() print(f"Original: {len(df_messy)} rows, After drop: {len(df_dropped)} rows") print("\n" + "="*60) print("STEP 4: Remove duplicates") print("="*60) print(f"Duplicate count: {df_messy.duplicated().sum()}") df_unique = df_messy.drop_duplicates(keep='first') print(f"After removing duplicates: {len(df_unique)} rows") print("\n" + "="*60) print("STEP 5: Replace outliers") print("="*60) print("Original Score column stats:") print(df_messy['Score'].describe()) # Cap at 95th percentile cap = df_messy['Score'].quantile(0.95) df_messy['Score_Capped'] = df_messy['Score'].clip(upper=cap) print(f"\nCapped at 95th percentile ({cap}):") print(df_messy[['Score', 'Score_Capped']])
β Common Mistakes & Solutions
| Mistake | Why it fails | Correct way |
|---|---|---|
df.fillna(df.mean()) | Inplace operation not specified | df.fillna(df.mean(), inplace=True) or assign to new variable |
| Using mean when data has outliers | Mean is skewed by outliers | Use median for skewed distributions |
df.dropna() without assigning | Returns new DataFrame, doesn't modify original | df_clean = df.dropna() or df.dropna(inplace=True) |
Confusing duplicated() and drop_duplicates() | One detects, one removes | df.duplicated() returns boolean mask; df.drop_duplicates() removes |
| Capping before analyzing distribution | Might hide important patterns | Always visualize before capping |
π Practice Exercises
Load a dataset with missing values and calculate the percentage of missing values in each column.
Fill missing values differently: numeric with median, categorical with mode.
Remove rows where more than 2 columns have missing values.
Find and remove duplicate rows based on specific columns (e.g., 'ID' and 'Name').
Cap outliers in a column at the 1st and 99th percentiles.
β Summary of Day 3
| Task | Method | Key Parameters |
|---|---|---|
| Count missing | df.isnull().sum() | None |
| Fill missing | df.fillna(value) | value, method='ffill' |
| Drop missing | df.dropna() | how='any'/'all', subset |
| Find duplicates | df.duplicated() | keep='first'/'last'/False |
| Remove duplicates | df.drop_duplicates() | keep, subset |
| Cap outliers | df['col'].clip(lower, upper) | lower, upper |
π Pro Tips for Real World
Always check missing data pattern - Is it random or systematic?
Document your cleaning steps - Future you will thank you
Keep original data - Work on copies, never modify original
Visualize before/after - Use histograms to see outlier impact
Consider domain knowledge - Sometimes 999 is a valid code, not outlier
Discussion (0)