π Day 2: Data Selection & Filtering
Prerequisite Setup (run this first)
python
import pandas as pd import numpy as np # Create sample dataset for all examples df = pd.DataFrame({ 'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'], 'Age': [25, 30, 35, 28, 22, 40, 33, 29], 'City': ['NYC', 'LA', 'Chicago', 'NYC', 'LA', 'Chicago', 'NYC', 'LA'], 'Salary': [50000, 60000, 70000, 55000, 48000, 80000, 65000, 52000], 'Experience': [2, 5, 8, 3, 1, 12, 6, 4] }, index=['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H']) print("Original DataFrame:") print(df)
1οΈβ£ Select single/multiple columns using bracket notation
Single column (returns Series)
python
# Method 1: df['column_name'] names = df['Name'] print(names) print(type(names)) # <class 'pandas.core.series.Series'>
Output:
text
A Alice B Bob C Charlie D Diana E Eve F Frank G Grace H Henry Name: Name, dtype: object
Multiple columns (returns DataFrame)
python
# Method: df[['col1', 'col2']] - Note the double brackets! subset = df[['Name', 'Age', 'Salary']] print(subset)
Output:
text
Name Age Salary A Alice 25 50000 B Bob 30 60000 C Charlie 35 70000 D Diana 28 55000 E Eve 22 48000 F Frank 40 80000 G Grace 33 65000 H Henry 29 52000
Why double brackets?
Single bracket
[]returns SeriesDouble brackets
[[]]returns DataFramedf['Name']vsdf[['Name']]β first is Series, second is 1-column DataFrame
Quick tip: Select columns by data type
python
# Select only numeric columns numeric_cols = df.select_dtypes(include=['int64', 'float64']) print(numeric_cols)
2οΈβ£ Use .loc to filter rows by label range
What is .loc?
Label-based indexing β uses index labels (row names/numbers) and column names.
Basic syntax: df.loc[row_labels, column_labels]
Select specific rows by label
python
# Single row print(df.loc['C']) # Returns Series # Multiple rows print(df.loc[['A', 'C', 'E']]) # Returns DataFrame
Label range (INCLUSIVE of both ends!)
python
# Rows from 'B' to 'E' (includes both B and E) print(df.loc['B':'E'])
Output:
text
Name Age City Salary Experience B Bob 30 LA 60000 5 C Charlie 35 Chicago 70000 8 D Diana 28 NYC 55000 3 E Eve 22 LA 48000 1
Select specific rows and columns
python
# Rows 'A' to 'D', only 'Name' and 'Salary' columns print(df.loc['A':'D', ['Name', 'Salary']])
Output:
text
Name Salary A Alice 50000 B Bob 60000 C Charlie 70000 D Diana 55000
All rows, specific columns
python
print(df.loc[:, ['Name', 'City']]) # : means all rows
3οΈβ£ Use .iloc to slice rows and columns by position
What is .iloc?
Integer position-based indexing β works like Python list slicing.
Basic syntax: df.iloc[row_positions, column_positions]
Select by integer positions
python
# First 3 rows (positions 0,1,2) print(df.iloc[0:3]) # Note: end exclusive (like list slicing)
Output (rows 0,1,2):
text
Name Age City Salary Experience A Alice 25 NYC 50000 2 B Bob 30 LA 60000 5 C Charlie 35 Chicago 70000 8
Row and column slicing
python
# Rows 2-5 (positions 2 to 4 inclusive), columns 0-2 (Name, Age, City) print(df.iloc[2:5, 0:3])
Output:
text
Name Age City C Charlie 35 Chicago D Diana 28 NYC E Eve 22 LA
Specific row and column positions
python
# Row at position 3, column at position 4 print(df.iloc[3, 4]) # Output: 3 (Experience of Diana) # Multiple specific rows and columns print(df.iloc[[0, 2, 5], [0, 3]]) # Rows 0,2,5 and columns 0,3
All rows, specific column positions
python
# All rows, columns 1 and 3 (Age and Salary) print(df.iloc[:, [1, 3]])
4οΈβ£ Filter DataFrame with multiple conditions (&, |)
IMPORTANT: Use & for AND, | for OR, and wrap each condition in ()
Single condition (review)
python
# Age > 30 older = df[df['Age'] > 30] print(older)
Multiple conditions with AND (&)
python
# Age between 25 AND 35, AND Salary > 55000 filtered = df[(df['Age'] >= 25) & (df['Age'] <= 35) & (df['Salary'] > 55000)] print(filtered)
Output:
text
Name Age City Salary Experience B Bob 30 LA 60000 5 C Charlie 35 Chicago 70000 8 G Grace 33 NYC 65000 6
Multiple conditions with OR (|)
python
# Either from NYC OR from LA nyc_or_la = df[(df['City'] == 'NYC') | (df['City'] == 'LA')] print(nyc_or_la)
Complex combination (AND + OR)
python
# (Age > 30 AND Salary > 60000) OR (Experience > 5) complex_filter = df[((df['Age'] > 30) & (df['Salary'] > 60000)) | (df['Experience'] > 5)] print(complex_filter)
5οΈβ£ Use .isin() and .between() for complex filters
.isin() β Check if values are in a list
python
# Filter cities in a list cities = ['NYC', 'Chicago'] filtered_cities = df[df['City'].isin(cities)] print(filtered_cities)
Output:
text
Name Age City Salary Experience A Alice 25 NYC 50000 2 C Charlie 35 Chicago 70000 8 D Diana 28 NYC 55000 3 G Grace 33 NYC 65000 6
Inverse with ~ (NOT)
python
# All cities EXCEPT NYC and LA not_nyc_la = df[~df['City'].isin(['NYC', 'LA'])] print(not_nyc_la) # Only Chicago
.between() β Range check (inclusive)
python
# Age between 25 and 35 (inclusive) age_filter = df[df['Age'].between(25, 35)] print(age_filter) # Salary between 50000 and 65000 salary_filter = df[df['Salary'].between(50000, 65000)] print(salary_filter[['Name', 'Salary']])
Combining .isin() and .between()
python
# Age 25-35 AND City is NYC or LA result = df[ df['Age'].between(25, 35) & df['City'].isin(['NYC', 'LA']) ] print(result)
π― Complete Hands-on Checklist (Run this)
python
import pandas as pd import numpy as np # Create dataset df = pd.DataFrame({ 'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Desk', 'Chair', 'Lamp'], 'Category': ['Electronics', 'Accessories', 'Accessories', 'Electronics', 'Furniture', 'Furniture', 'Accessories'], 'Price': [1200, 25, 75, 300, 450, 200, 35], 'Stock': [10, 100, 50, 15, 5, 20, 60], 'Rating': [4.5, 4.2, 4.3, 4.7, 4.1, 4.4, 4.0] }, index=['P1', 'P2', 'P3', 'P4', 'P5', 'P6', 'P7']) print("="*50) print("1. Column Selection") print("="*50) print("Single column (Price):") print(df['Price']) print("\nMultiple columns (Product, Price):") print(df[['Product', 'Price']]) print("\n" + "="*50) print("2. .loc Label-based Filtering") print("="*50) print("Rows P2 to P4:") print(df.loc['P2':'P4']) print("\nRows P1 to P3, columns Product and Rating:") print(df.loc['P1':'P3', ['Product', 'Rating']]) print("\n" + "="*50) print("3. .iloc Position-based Filtering") print("="*50) print("First 3 rows:") print(df.iloc[0:3]) print("\nRows 2-4, columns 0 and 2:") print(df.iloc[1:4, [0, 2]]) print("\n" + "="*50) print("4. Multiple Conditions (&, |)") print("="*50) print("Price > 100 AND Stock < 30:") print(df[(df['Price'] > 100) & (df['Stock'] < 30)]) print("\n" + "="*50) print("5. .isin() and .between()") print("="*50) print("Category in [Electronics, Furniture]:") print(df[df['Category'].isin(['Electronics', 'Furniture'])]) print("\nPrice between 50 and 400:") print(df[df['Price'].between(50, 400)])
β Common Mistakes & Solutions
| Mistake | Why it fails | Correct way |
|---|---|---|
df[df['Age'] > 30 & df['Salary'] > 50000] | Missing parentheses | df[(df['Age'] > 30) & (df['Salary'] > 50000)] |
df.loc[1:3] when index is not numeric | Label vs position confusion | Use df.iloc[1:3] for positions |
df['Age', 'Name'] | Wrong syntax for multiple columns | df[['Age', 'Name']] |
df[df['City'] == 'NYC' or 'LA'] | or doesn't work with Series | df[df['City'].isin(['NYC', 'LA'])] |
df.loc['A':'D', 'Name'] | Works but returns Series | Use df.loc['A':'D', ['Name']] for DataFrame |
π Practice Exercises
From the products DataFrame, select all rows where:
Price > 100 AND Stock > 10
Use
.ilocto get rows 3-5 and columns 1, 3, 4Filter products where Category is 'Accessories' using
.isin()Use
.between()to find products with Rating between 4.2 and 4.6Combine conditions: (Price < 100 OR Category == 'Electronics') AND Stock > 20
β Summary of Day 2
| Method | Purpose | Example |
|---|---|---|
df['col'] | Single column | df['Age'] |
df[['col1','col2']] | Multiple columns | df[['Name','Age']] |
.loc[] | Label-based selection | df.loc['A':'D', 'Name'] |
.iloc[] | Position-based selection | df.iloc[0:3, 0:2] |
&, | | Multiple conditions | df[(df['Age']>25) & (df['Salary']<60000)] |
.isin() | Match any in list | df[df['City'].isin(['NYC','LA'])] |
.between() | Range filter | df[df['Age'].between(25,35)] |
Discussion (0)