No sign-in needed - English narration - Safe for all school ages
Meet Kavita โ Class 9, Chennai
Kavita downloaded a dataset of school exam results from data.gov.in โ 2,400 students from Tamil Nadu. Excited, she loaded it into Colab and immediately ran her scikit-learn classifier from last lesson. Accuracy: 51%. Worse than random guessing.
Confused, she looked at the raw data. Some rows had "N/A" for marks. One row had marks of 999. Three students appeared twice. City names included "Chennai", "chennai", and "CHENNAI" โ all meant the same thing. Her data was messy. After two hours of cleaning, accuracy jumped to 86%. Today you'll learn exactly what she did โ and why data cleaning is where real AI work happens.
Why This Matters
Garbage In, Garbage Out
Data scientists say "80% of AI work is data preparation." The quality of your model is directly limited by the quality of your data. A perfect algorithm on dirty data will fail. A simple algorithm on clean data will surprise you.
Famous example: In 2016, an Amazon hiring AI was found to penalise resumes that mentioned the word "women's" (e.g., "women's chess club"). The model learned bias from 10 years of historically male-dominated hiring data. Dirty training data โ biased model.
Part 1
The Five Most Common Data Problems
โ
Missing Values
Blank cells or NaN (Not a Number). Happens when data wasn't recorded, wasn't available, or was accidentally deleted.
Fix: Fill with mean/median/mode, or drop rows with too many missing values
๐
Duplicates
Same row appears twice or more. Happens from data entry errors, merging multiple sources, or form resubmission.
Fix: df.drop_duplicates()
๐ฏ
Outliers
Values that are impossibly large or small. Age = 999, marks = -5, income = 0 rupees.
Fix: Filter rows outside a reasonable range, or cap at a percentile
๐ค
Inconsistent Categories
"Mumbai", "mumbai", "MUMBAI" all mean the same thing but are treated as 3 separate categories by a model.
Fix: df['city'].str.lower().str.strip()
๐ข
Wrong Data Types
Age stored as text "25" (string) not 25 (integer). Date stored as "15 Aug 2023" not a proper datetime object.
Fix: df['age'] = df['age'].astype(int)
student_id
marks
city
grade
passed
S001
72
Hyderabad
B
Yes
S002
NaN
Chennai
C
Yes
S003
999
chennai
A
Yes
S002
45
Chennai
C
No
S005
88
MUMBAI
A
Yes
Red values above show the problems: NaN (missing), 999 (outlier), "chennai" โ "Chennai" (inconsistent), S002 appears twice (duplicate).
Part 2
Handling Missing Values
There are three main strategies for dealing with NaN values:
Drop the row: Use when only a few rows have missing values and you have lots of data. Risk: lose information.
Fill with mean/median: Replace NaN with the average (mean) or middle value (median) of that column. Good for numerical columns.
Fill with mode: Replace NaN with the most common value. Good for categorical columns.
import pandas as pd
import numpy as np
# Create sample data with missing values
data = {'marks': [72, np.nan, 88, 45, np.nan, 93],
'city': ['Mumbai', np.nan, 'Delhi', 'Mumbai', 'Chennai', np.nan],
'passed': [1, 1, 1, 0, 0, 1]}
df = pd.DataFrame(data)
print("Missing values before:\n", df.isnull().sum())
# Strategy 1: Drop rows where ANY column has NaN
df_dropped = df.dropna()
print(f"Rows after drop: {len(df_dropped)}") # Loses rows!
# Strategy 2: Fill numerical with median
df['marks'] = df['marks'].fillna(df['marks'].median())
# Strategy 3: Fill categorical with mode (most common value)
df['city'] = df['city'].fillna(df['city'].mode()[0])
print("Missing values after:\n", df.isnull().sum())
print(df)
Which to use? Use median (not mean) when you have outliers โ e.g., if one student has marks=999, the mean is distorted but the median is not. For "city", use mode (most common city).
Part 3
Removing Duplicates and Fixing Text
# Removing duplicate rows
df = df.drop_duplicates()
print(f"Rows after removing duplicates: {len(df)}")
# Fixing inconsistent city names
df['city'] = df['city'].str.lower().str.strip()
# Now 'Chennai', 'CHENNAI', 'chennai' all become 'chennai'
# Check unique values after fixing
print(df['city'].unique())
# Fix wrong data types (age stored as string "25")
df['age'] = df['age'].astype(int)
# Handle outliers: marks should be between 0 and 100
df = df[df['marks'].between(0, 100)]
print(f"Rows after removing outliers: {len(df)}")
Part 4
Encoding Categorical Data
Machine learning models understand numbers, not text. The city column ("mumbai", "delhi", "chennai") must be converted to numbers before training. There are two common approaches:
Label Encoding
Assign each category a number: delhi=0, chennai=1, mumbai=2. Simple, but implies a fake order (Mumbai > Delhi).
Use only when categories have a natural order (e.g., Low/Medium/High)
One-Hot Encoding
Creates a new binary (0/1) column for each category. "city_mumbai", "city_delhi", "city_chennai". No fake ordering.
Rule of thumb: If a column has more than 30โ40% missing values, consider dropping that column entirely rather than filling it โ there's too little data to trust. Document every cleaning decision you make so you can reproduce or reverse it.
๐งช Check Your Understanding โ Lesson 3 Quiz
1. You have 1,000 rows and 3 rows have missing "marks" values. What is the BEST strategy?
a) Drop those 3 rows (very small loss of data)
b) Fill all missing values with the number 0
c) Delete the entire "marks" column
d) Fill with median marks from the column
2. Why is it better to fill missing numerical values with the MEDIAN rather than the MEAN when outliers exist?
a) Median is always larger than mean
b) Mean is harder to calculate
c) Median is not affected by extreme values (outliers), so it better represents typical data
d) Median works only for categorical data
3. You have a "city" column with "Delhi", "delhi", "DELHI". What is the correct pandas fix?
a) df['city'] = df['city'].astype(int)
b) df['city'] = df['city'].str.lower().str.strip()
c) df.drop_duplicates()
d) df['city'].fillna('unknown')
4. One-hot encoding is preferred over label encoding for city names because:
a) It uses less memory
b) Label encoding would imply a fake order (Delhi=0 < Mumbai=1) that doesn't really exist
c) One-hot encoding is faster to compute
d) Label encoding does not work on strings
5. A marks column contains a value of 999. This is most likely:
a) A perfectly valid mark
b) A missing-value code used by the data source instead of NaN
c) An ordinal value
d) A binary value
6. Which pandas command removes rows that are exact duplicates?
a) df.dropna()
b) df.reset_index()
c) df.drop_duplicates()
d) df.fillna(0)
7. If 45% of a column's values are missing, the best action is usually:
a) Fill all missing values with 0
b) Drop the entire column โ too little data to trust
c) Fill with the maximum value
d) Leave the NaN values as-is for the model to handle
8. After cleaning, how do you verify your data has zero missing values?