Lesson 5 — Model Evaluation | Class 9

Meet Divya — Class 9, Coimbatore

Divya built a disease detection model. It predicted whether a patient had a rare disease — and got 99% accuracy! She was thrilled. But her doctor-aunt looked at the numbers carefully. "Wait — how many patients actually had the disease in your test set?" she asked. "Just 1 out of 100," said Divya. Her aunt smiled: "So if the model just always predicts 'no disease' — it's 99% accurate. But it misses every single sick person. That's not a good medical AI."

Today we'll learn why accuracy alone is often not enough — and what metrics actually tell you whether your model is useful.

Part 1

The Confusion Matrix — Four Types of Results

When a binary classifier runs on test data, every prediction falls into one of four boxes:

↕ Actual

Predicted: Positive

Predicted: Negative

Actual: Positive

True Positive (TP)

Sick patient, detected sick ✓

False Negative (FN)

Sick patient, missed! ✗

Actual: Negative

False Positive (FP)

Healthy, called sick ✗

True Negative (TN)

Healthy, detected healthy ✓

TP (True Positive): Correctly predicted as positive — the model got it right for positives
TN (True Negative): Correctly predicted as negative — the model got it right for negatives
FP (False Positive): Predicted positive but actually negative — "false alarm"
FN (False Negative): Predicted negative but actually positive — "missed case" (often dangerous in medical AI!)

Part 2

Four Evaluation Metrics

Using the confusion matrix numbers, we can calculate four metrics that each measure something different:

✅

Accuracy

(TP+TN) / All

Good when classes are balanced. Misleading for imbalanced datasets (like Divya's disease problem).

🎯

Precision

TP / (TP + FP)

Of all positive predictions, how many were actually positive? Use when false alarms are costly.

🔍

Recall (Sensitivity)

TP / (TP + FN)

Of all actual positives, how many did we find? Use when missing a case is costly (medical, fraud).

⚖️

F1 Score

2 × (P×R) / (P+R)

Harmonic mean of precision and recall. Best single metric for imbalanced datasets.

Using the example above (TP=45, FP=10, FN=5, TN=40):
Accuracy = (45+40)/(45+10+5+40) = 85/100 = 85%
Precision = 45/(45+10) = 45/55 = 81.8%
Recall = 45/(45+5) = 45/50 = 90%
F1 = 2 × (0.818 × 0.90) / (0.818+0.90) = 85.7%

Part 3

Which Metric Should You Use?

Medical diagnosis (cancer detection): Use Recall — missing a sick person (FN) is far worse than a false alarm
Spam filter: Use Precision — marking a real email as spam (FP) is worse than letting one spam through
Fraud detection: Use F1 — both false alarms and missed fraud are costly
Balanced datasets (50/50 classes): Use Accuracy — it's reliable when both classes appear equally

Rule of thumb: If one class is rare (less than 20% of data), don't trust accuracy alone. Use Recall if missing rare cases is dangerous. Use Precision if false alarms are expensive. Use F1 if both matter.

Part 4

Overfitting vs Underfitting

Your model can fail in two opposite ways. The key is finding the right balance:

😕 Underfitting

Train accuracy: 60%

Test accuracy: 58%

Model is too simple. Missed important patterns. Fix: deeper tree, more features, more epochs.

✅ Just Right

Train accuracy: 92%

Test accuracy: 89%

Model learned real patterns, generalises well to new data. This is the goal.

⚠️ Overfitting

Train accuracy: 99%

Test accuracy: 63%

Model memorised training data but fails on new data. Fix: simpler model, more data, regularisation.

How to detect overfitting: Training accuracy is much higher than test accuracy (gap > 10–15%). This is why we always track BOTH metrics — not just the training score.

Try It in Colab

Calculate All Metrics in Python

from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (accuracy_score, precision_score,
                              recall_score, f1_score, confusion_matrix)
import pandas as pd

# Load imbalanced-style dataset (breast cancer: 212 malignant, 357 benign)
data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

model = DecisionTreeClassifier(max_depth=5, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# All four metrics
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall:    {recall_score(y_test, y_pred):.3f}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.3f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(pd.DataFrame(cm,
    index=['Actual: Malignant', 'Actual: Benign'],
    columns=['Predicted: Malignant', 'Predicted: Benign']))

# Overfitting check
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, y_pred)
print(f"\nTrain accuracy: {train_acc:.3f}")
print(f"Test accuracy:  {test_acc:.3f}")
print(f"Gap: {train_acc - test_acc:.3f}  (aim for < 0.05)")

🧪 Check Your Understanding — Lesson 5 Quiz

1. A model detects rare fraud (1% of transactions are fraud). It predicts "not fraud" for everything. Accuracy = 99%. Is this a good model?

a) Yes — 99% is excellent

b) No — it misses every single fraud case (Recall = 0). Accuracy is misleading here.

c) It depends on the dataset size

d) Yes, if the business doesn't mind missing some fraud

2. False Negatives (FN) are when the model:

a) Correctly identifies a negative case

b) Predicts positive when the actual answer is negative (false alarm)

c) Predicts negative when the actual answer is positive (missed case)

d) Makes no prediction at all

3. For a cancer screening AI, which metric should you maximise?

a) Accuracy

b) Precision (avoid false alarms)

c) Recall (catch every sick patient, even if there are false alarms)

d) Training speed

4. Your model has Train Accuracy = 98% and Test Accuracy = 64%. This is:

a) Underfitting — the model is too simple

b) Perfect — high train accuracy is the goal

c) Overfitting — the model memorised training data but fails on new data

d) Normal — test accuracy is always lower

5. F1 Score is most useful when:

a) You have perfectly balanced classes (50% / 50%)

b) You have imbalanced classes and both false positives and false negatives are costly

c) Recall doesn't matter in your problem

d) You need a simple metric students can calculate mentally

6. Precision = TP / (TP + FP) measures:

a) Of all actual positives, how many did we correctly catch?

b) Of all predictions of positive, how many were actually positive?

c) The total number of correct predictions

d) The difference between train and test accuracy

7. Underfitting occurs when:

a) Training accuracy is much higher than test accuracy

b) Both training and test accuracy are low — the model is too simple to learn real patterns

c) The confusion matrix has many True Positives

d) F1 score equals accuracy

8. For a spam filter, you prefer high Precision because:

a) Catching all spam is more important than false alarms

b) Missing spam is the most dangerous outcome

c) Incorrectly marking an important email as spam (FP) is worse than letting one spam through (FN)

d) Precision is always better than Recall for text data

← Lesson 4: First Classifier Lesson 6: Regression →

Model Evaluation: Is Your AI Good? 📏

Class 9 Lesson 5 - Model Evaluation: Is Your AI Good?

🧪 Check Your Understanding — Lesson 5 Quiz