Lesson 4 — Build Your First Classifier | Class 9

Meet Rohan — Class 9, Hyderabad

Rohan's sister works at a bank. She told him that the bank uses AI to decide whether to approve or reject a loan application. "It looks at income, job type, age, and past repayments — and decides in seconds," she said. Rohan asked: "But how does it decide?"

She smiled. "It uses a classifier. You've learned about data, you've cleaned data — now today you'll actually build one. Three years ago this took a PhD. Today you can do it in 20 lines of Python." Let's build it.

The Big Picture

What Is Classification?

Classification is teaching an AI to sort examples into categories. Input: a set of features. Output: one category from a list. Examples:

Email → Spam / Not Spam
Loan application → Approve / Reject
Symptom list → Disease A / Disease B / Healthy
Student data → Pass / Fail
Image → Cat / Dog / Bird

When there are exactly 2 output categories (Pass/Fail, Yes/No) it is called binary classification. When there are 3 or more categories (cats/dogs/birds) it is called multiclass classification.

Part 1

Train/Test Split — The Most Important Rule

Before training any model, you must split your data into two groups:

Training Data

80%

Model learns from this. It sees the questions AND the answers.

Test Data

20%

Model is tested here. It only sees questions — no answers.

📝 Analogy: The Exam System

Imagine a student studies from 4 textbooks (training data). The exam question paper is a brand-new paper the student has never seen (test data). If we tested on the same questions the student studied, every student would score 100% — that tells us nothing about real ability. The train/test split does the same thing: it tests the model on new, unseen data to measure real performance.

Never train and test on the same data. If you do, your accuracy will be artificially high (the model memorised the answers). This is called data leakage and is one of the most common mistakes beginners make.

Part 2

Decision Tree: How It Works

A Decision Tree is the easiest classifier to understand. It makes predictions by asking a series of yes/no questions — exactly like a flowchart:

Loan Approval Decision Tree (simplified)

Income > ₹30,000/month?

YES ✓

Past defaults?

No → APPROVE

Yes → REJECT

NO ✗

→ REJECT

The model learns which questions to ask (which features to check) and what thresholds to use (income > ₹30,000) by finding the splits that best separate the training data into correct classes.

Part 3

Building It in Python (Step by Step)

Here's the complete workflow — paste this into Google Colab and run it:

Import librariesscikit-learn (sklearn) has everything you need. It's pre-installed in Colab.
Load a datasetWe'll use the Iris flower dataset — 150 rows, 4 features, 3 flower species. Classic ML starter dataset.
Split into train and test80% train, 20% test — done in one line with train_test_split.
Create and train the modelmodel.fit(X_train, y_train) — the model learns from the training data.
Make predictions and measure accuracymodel.predict(X_test) — try it on the unseen test data.

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Step 1: Load dataset
data = load_iris()
X = data.data      # Features: sepal length, sepal width, petal length, petal width
y = data.target    # Labels: 0=setosa, 1=versicolor, 2=virginica

print(f"Dataset shape: {X.shape}")   # (150, 4)
print(f"Classes: {data.target_names}")

# Step 2: Train/Test Split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

print(f"Training rows: {len(X_train)}")   # 120
print(f"Testing rows: {len(X_test)}")     # 30

# Step 3: Create and train Decision Tree
model = DecisionTreeClassifier(max_depth=3, random_state=42)
model.fit(X_train, y_train)   # ← This is "training" (learning from data)

# Step 4: Make predictions on test data
y_pred = model.predict(X_test)

# Step 5: Measure accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Test accuracy: {accuracy:.2%}")   # Usually around 93–97%

# Predict a single new flower
# Features: [sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2]
new_flower = [[5.1, 3.5, 1.4, 0.2]]
prediction = model.predict(new_flower)
print(f"Predicted class: {data.target_names[prediction[0]]}")   # setosa

Try changing max_depth! max_depth=1 = very simple tree (might underfit). max_depth=10 = very complex tree (might overfit). We'll learn exactly what these terms mean in Lesson 5.

Part 4

Using Your Own Indian Dataset

Now apply the same steps to a real dataset. Here's how to classify students as pass/fail using Colab:

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Create a student dataset (or download from data.gov.in)
np.random.seed(42)
n = 200
df = pd.DataFrame({
    'maths_marks': np.random.randint(20, 100, n),
    'science_marks': np.random.randint(25, 100, n),
    'attendance_pct': np.random.randint(50, 100, n),
    'study_hours': np.random.randint(1, 10, n)
})
# Create label: pass if average marks > 50 AND attendance > 70
df['passed'] = ((df[['maths_marks','science_marks']].mean(axis=1) > 50) 
                 & (df['attendance_pct'] > 70)).astype(int)

X = df.drop('passed', axis=1)
y = df['passed']

# Split, train, predict
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = DecisionTreeClassifier(max_depth=4, random_state=42)
model.fit(X_train, y_train)

print(f"Accuracy: {accuracy_score(y_test, model.predict(X_test)):.2%}")

# Feature importance: which feature mattered most?
importances = pd.Series(model.feature_importances_, index=X.columns)
print("\nFeature Importance:")
print(importances.sort_values(ascending=False))

Feature Importance tells you which columns the model relied on most. If "maths_marks" importance = 0.72, the model based 72% of its decisions on that one column. This is useful for both understanding the model and improving your dataset.

🧪 Check Your Understanding — Lesson 4 Quiz

1. Why do we split data into training and test sets?

a) To reduce the size of the dataset

b) To measure how well the model performs on data it has never seen before

c) Because scikit-learn requires it

d) To remove outliers from the data

2. What does model.fit(X_train, y_train) actually do?

a) Tests the model on unseen data

b) Loads the dataset from a file

c) Trains the model — adjusts its internal parameters using the training data

d) Calculates the accuracy score

3. If you train AND test on the same data, what problem occurs?

a) The model trains too slowly

b) You get artificially high accuracy (data leakage) — the model just memorised the answers

c) scikit-learn throws an error

d) The dataset becomes corrupted

4. A Decision Tree classifier predicts which flower species (setosa/versicolor/virginica) from 4 measurements. This is:

a) Binary classification

b) Regression

c) Multiclass classification

d) Clustering

5. The random_state=42 parameter in train_test_split ensures:

a) Exactly 42% of data goes to the test set

b) The split is random every time you run the code

c) The same random split is produced every time, making results reproducible

d) 42 rows are always in the test set

6. Feature importance in a Decision Tree tells you:

a) How many rows each feature appears in

b) Which features the model relied on most when making decisions

c) Whether features have missing values

d) The average value of each feature

7. Accuracy of a classifier is calculated as:

a) Number of training rows ÷ total rows

b) Number of features × number of rows

c) Correct predictions ÷ total test predictions

d) Number of wrong predictions ÷ number of features

8. A Decision Tree with max_depth=1 asks only one question. This likely:

a) Overfits the data perfectly

b) Is too simple and will underfit (miss important patterns)

c) Works better than deeper trees for complex data

d) Causes a Python error

← Lesson 3: Cleaning Data Lesson 5: Model Evaluation →

Build Your First Classifier ⚙️

Class 9 Lesson 4 - Build Your First Classifier

🧪 Check Your Understanding — Lesson 4 Quiz