Lesson 3 — Cleaning and Preparing Data | Class 9

Meet Kavita — Class 9, Chennai

Kavita downloaded a dataset of school exam results from data.gov.in — 2,400 students from Tamil Nadu. Excited, she loaded it into Colab and immediately ran her scikit-learn classifier from last lesson. Accuracy: 51%. Worse than random guessing.

Confused, she looked at the raw data. Some rows had "N/A" for marks. One row had marks of 999. Three students appeared twice. City names included "Chennai", "chennai", and "CHENNAI" — all meant the same thing. Her data was messy. After two hours of cleaning, accuracy jumped to 86%. Today you'll learn exactly what she did — and why data cleaning is where real AI work happens.

Why This Matters

Garbage In, Garbage Out

Data scientists say "80% of AI work is data preparation." The quality of your model is directly limited by the quality of your data. A perfect algorithm on dirty data will fail. A simple algorithm on clean data will surprise you.

Famous example: In 2016, an Amazon hiring AI was found to penalise resumes that mentioned the word "women's" (e.g., "women's chess club"). The model learned bias from 10 years of historically male-dominated hiring data. Dirty training data → biased model.

Part 1

The Five Most Common Data Problems

❓

Missing Values

Blank cells or NaN (Not a Number). Happens when data wasn't recorded, wasn't available, or was accidentally deleted.

Fix: Fill with mean/median/mode, or drop rows with too many missing values

🔁

Duplicates

Same row appears twice or more. Happens from data entry errors, merging multiple sources, or form resubmission.

Fix: df.drop_duplicates()

🎯

Outliers

Values that are impossibly large or small. Age = 999, marks = -5, income = 0 rupees.

Fix: Filter rows outside a reasonable range, or cap at a percentile

🔤

Inconsistent Categories

"Mumbai", "mumbai", "MUMBAI" all mean the same thing but are treated as 3 separate categories by a model.

Fix: df['city'].str.lower().str.strip()

🔢

Wrong Data Types

Age stored as text "25" (string) not 25 (integer). Date stored as "15 Aug 2023" not a proper datetime object.

Fix: df['age'] = df['age'].astype(int)

student_id	marks	city	grade	passed
S001	72	Hyderabad	B	Yes
S002	NaN	Chennai	C	Yes
S003	999	chennai	A	Yes
S002	45	Chennai	C	No
S005	88	MUMBAI	A	Yes

Red values above show the problems: NaN (missing), 999 (outlier), "chennai" ≠ "Chennai" (inconsistent), S002 appears twice (duplicate).

Part 2

Handling Missing Values

There are three main strategies for dealing with NaN values:

Drop the row: Use when only a few rows have missing values and you have lots of data. Risk: lose information.
Fill with mean/median: Replace NaN with the average (mean) or middle value (median) of that column. Good for numerical columns.
Fill with mode: Replace NaN with the most common value. Good for categorical columns.

import pandas as pd
import numpy as np

# Create sample data with missing values
data = {'marks': [72, np.nan, 88, 45, np.nan, 93],
        'city': ['Mumbai', np.nan, 'Delhi', 'Mumbai', 'Chennai', np.nan],
        'passed': [1, 1, 1, 0, 0, 1]}
df = pd.DataFrame(data)

print("Missing values before:\n", df.isnull().sum())

# Strategy 1: Drop rows where ANY column has NaN
df_dropped = df.dropna()
print(f"Rows after drop: {len(df_dropped)}")  # Loses rows!

# Strategy 2: Fill numerical with median
df['marks'] = df['marks'].fillna(df['marks'].median())

# Strategy 3: Fill categorical with mode (most common value)
df['city'] = df['city'].fillna(df['city'].mode()[0])

print("Missing values after:\n", df.isnull().sum())
print(df)

Which to use? Use median (not mean) when you have outliers — e.g., if one student has marks=999, the mean is distorted but the median is not. For "city", use mode (most common city).

Part 3

Removing Duplicates and Fixing Text

# Removing duplicate rows
df = df.drop_duplicates()
print(f"Rows after removing duplicates: {len(df)}")

# Fixing inconsistent city names
df['city'] = df['city'].str.lower().str.strip()
# Now 'Chennai', 'CHENNAI', 'chennai' all become 'chennai'

# Check unique values after fixing
print(df['city'].unique())

# Fix wrong data types (age stored as string "25")
df['age'] = df['age'].astype(int)

# Handle outliers: marks should be between 0 and 100
df = df[df['marks'].between(0, 100)]
print(f"Rows after removing outliers: {len(df)}")

Part 4

Encoding Categorical Data

Machine learning models understand numbers, not text. The city column ("mumbai", "delhi", "chennai") must be converted to numbers before training. There are two common approaches:

Label Encoding

Assign each category a number: delhi=0, chennai=1, mumbai=2. Simple, but implies a fake order (Mumbai > Delhi).

Use only when categories have a natural order (e.g., Low/Medium/High)

One-Hot Encoding

Creates a new binary (0/1) column for each category. "city_mumbai", "city_delhi", "city_chennai". No fake ordering.

Use for most categorical columns

# One-hot encoding with pandas
import pandas as pd

df = pd.DataFrame({'city': ['mumbai', 'delhi', 'chennai', 'mumbai']})

df_encoded = pd.get_dummies(df, columns=['city'])
print(df_encoded)
# Output:
#    city_chennai  city_delhi  city_mumbai
# 0             0           0            1
# 1             0           1            0
# 2             1           0            0
# 3             0           0            1

Part 5

The Full Data Cleaning Pipeline

Real data cleaning follows a checklist in order:

Step 1

Load & Inspect

df.head(), df.shape, df.dtypes, df.describe()

Step 2

Find Missing

df.isnull().sum() — count NaNs per column

Step 3

Remove Duplicates

df.drop_duplicates()

Step 4

Fix Text & Types

.str.lower(), .astype(int/float)

Step 5

Handle Missing

fillna(median) or dropna()

Step 6

Remove Outliers

.between() or IQR filter

Step 7

Encode Categories

pd.get_dummies() or LabelEncoder

Step 8

Verify & Save

Check shapes, isnull().sum()=0, save clean CSV

Rule of thumb: If a column has more than 30–40% missing values, consider dropping that column entirely rather than filling it — there's too little data to trust. Document every cleaning decision you make so you can reproduce or reverse it.

🧪 Check Your Understanding — Lesson 3 Quiz

1. You have 1,000 rows and 3 rows have missing "marks" values. What is the BEST strategy?

a) Drop those 3 rows (very small loss of data)

b) Fill all missing values with the number 0

c) Delete the entire "marks" column

d) Fill with median marks from the column

2. Why is it better to fill missing numerical values with the MEDIAN rather than the MEAN when outliers exist?

a) Median is always larger than mean

b) Mean is harder to calculate

c) Median is not affected by extreme values (outliers), so it better represents typical data

d) Median works only for categorical data

3. You have a "city" column with "Delhi", "delhi", "DELHI". What is the correct pandas fix?

a) df['city'] = df['city'].astype(int)

b) df['city'] = df['city'].str.lower().str.strip()

c) df.drop_duplicates()

d) df['city'].fillna('unknown')

4. One-hot encoding is preferred over label encoding for city names because:

a) It uses less memory

b) Label encoding would imply a fake order (Delhi=0 < Mumbai=1) that doesn't really exist

c) One-hot encoding is faster to compute

d) Label encoding does not work on strings

5. A marks column contains a value of 999. This is most likely:

a) A perfectly valid mark

b) A missing-value code used by the data source instead of NaN

c) An ordinal value

d) A binary value

6. Which pandas command removes rows that are exact duplicates?

a) df.dropna()

b) df.reset_index()

c) df.drop_duplicates()

d) df.fillna(0)

7. If 45% of a column's values are missing, the best action is usually:

a) Fill all missing values with 0

b) Drop the entire column — too little data to trust

c) Fill with the maximum value

d) Leave the NaN values as-is for the model to handle

8. After cleaning, how do you verify your data has zero missing values?

a) df.shape

b) df.describe()

c) df.isnull().sum()

d) df.head()

← Lesson 2: Datasets Lesson 4: Build a Classifier →

Cleaning and Preparing Data 🧹

Class 9 Lesson 3 - Cleaning and Preparing Data

🧪 Check Your Understanding — Lesson 3 Quiz