Rohan's sister works at a bank. She told him that the bank uses AI to decide whether to approve or reject a loan application. "It looks at income, job type, age, and past repayments โ and decides in seconds," she said. Rohan asked: "But how does it decide?"
She smiled. "It uses a classifier. You've learned about data, you've cleaned data โ now today you'll actually build one. Three years ago this took a PhD. Today you can do it in 20 lines of Python." Let's build it.
Classification is teaching an AI to sort examples into categories. Input: a set of features. Output: one category from a list. Examples:
- Email โ Spam / Not Spam
- Loan application โ Approve / Reject
- Symptom list โ Disease A / Disease B / Healthy
- Student data โ Pass / Fail
- Image โ Cat / Dog / Bird
Before training any model, you must split your data into two groups:
Imagine a student studies from 4 textbooks (training data). The exam question paper is a brand-new paper the student has never seen (test data). If we tested on the same questions the student studied, every student would score 100% โ that tells us nothing about real ability. The train/test split does the same thing: it tests the model on new, unseen data to measure real performance.
A Decision Tree is the easiest classifier to understand. It makes predictions by asking a series of yes/no questions โ exactly like a flowchart:
The model learns which questions to ask (which features to check) and what thresholds to use (income > โน30,000) by finding the splits that best separate the training data into correct classes.
Here's the complete workflow โ paste this into Google Colab and run it:
- Import librariesscikit-learn (sklearn) has everything you need. It's pre-installed in Colab.
- Load a datasetWe'll use the Iris flower dataset โ 150 rows, 4 features, 3 flower species. Classic ML starter dataset.
- Split into train and test80% train, 20% test โ done in one line with train_test_split.
- Create and train the modelmodel.fit(X_train, y_train) โ the model learns from the training data.
- Make predictions and measure accuracymodel.predict(X_test) โ try it on the unseen test data.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Step 1: Load dataset
data = load_iris()
X = data.data # Features: sepal length, sepal width, petal length, petal width
y = data.target # Labels: 0=setosa, 1=versicolor, 2=virginica
print(f"Dataset shape: {X.shape}") # (150, 4)
print(f"Classes: {data.target_names}")
# Step 2: Train/Test Split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
print(f"Training rows: {len(X_train)}") # 120
print(f"Testing rows: {len(X_test)}") # 30
# Step 3: Create and train Decision Tree
model = DecisionTreeClassifier(max_depth=3, random_state=42)
model.fit(X_train, y_train) # โ This is "training" (learning from data)
# Step 4: Make predictions on test data
y_pred = model.predict(X_test)
# Step 5: Measure accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Test accuracy: {accuracy:.2%}") # Usually around 93โ97%
# Predict a single new flower
# Features: [sepal_length=5.1, sepal_width=3.5, petal_length=1.4, petal_width=0.2]
new_flower = [[5.1, 3.5, 1.4, 0.2]]
prediction = model.predict(new_flower)
print(f"Predicted class: {data.target_names[prediction[0]]}") # setosaNow apply the same steps to a real dataset. Here's how to classify students as pass/fail using Colab:
import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Create a student dataset (or download from data.gov.in)
np.random.seed(42)
n = 200
df = pd.DataFrame({
'maths_marks': np.random.randint(20, 100, n),
'science_marks': np.random.randint(25, 100, n),
'attendance_pct': np.random.randint(50, 100, n),
'study_hours': np.random.randint(1, 10, n)
})
# Create label: pass if average marks > 50 AND attendance > 70
df['passed'] = ((df[['maths_marks','science_marks']].mean(axis=1) > 50)
& (df['attendance_pct'] > 70)).astype(int)
X = df.drop('passed', axis=1)
y = df['passed']
# Split, train, predict
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = DecisionTreeClassifier(max_depth=4, random_state=42)
model.fit(X_train, y_train)
print(f"Accuracy: {accuracy_score(y_test, model.predict(X_test)):.2%}")
# Feature importance: which feature mattered most?
importances = pd.Series(model.feature_importances_, index=X.columns)
print("\nFeature Importance:")
print(importances.sort_values(ascending=False))