Model Evaluation: Is Your AI Good? ๐Ÿ“

Class 9Age 13โ€“14Lesson 5 of 12๐Ÿ†“ Free
Confusion matrix and evaluation metrics displayed on a data science dashboard on a monitor in a modern classroom
Watch first - 2-3 minutes

Class 9 Lesson 5 - Model Evaluation: Is Your AI Good?

No sign-in needed - English narration - Safe for all school ages

Meet Divya โ€” Class 9, Coimbatore

Divya built a disease detection model. It predicted whether a patient had a rare disease โ€” and got 99% accuracy! She was thrilled. But her doctor-aunt looked at the numbers carefully. "Wait โ€” how many patients actually had the disease in your test set?" she asked. "Just 1 out of 100," said Divya. Her aunt smiled: "So if the model just always predicts 'no disease' โ€” it's 99% accurate. But it misses every single sick person. That's not a good medical AI."

Today we'll learn why accuracy alone is often not enough โ€” and what metrics actually tell you whether your model is useful.

Part 1
The Confusion Matrix โ€” Four Types of Results

When a binary classifier runs on test data, every prediction falls into one of four boxes:

โ†• Actual
Predicted: Positive
Predicted: Negative
Actual: Positive
True Positive (TP)
45
Sick patient, detected sick โœ“
False Negative (FN)
5
Sick patient, missed! โœ—
Actual: Negative
False Positive (FP)
10
Healthy, called sick โœ—
True Negative (TN)
40
Healthy, detected healthy โœ“
Part 2
Four Evaluation Metrics

Using the confusion matrix numbers, we can calculate four metrics that each measure something different:

โœ…
Accuracy
(TP+TN) / All
Good when classes are balanced. Misleading for imbalanced datasets (like Divya's disease problem).
๐ŸŽฏ
Precision
TP / (TP + FP)
Of all positive predictions, how many were actually positive? Use when false alarms are costly.
๐Ÿ”
Recall (Sensitivity)
TP / (TP + FN)
Of all actual positives, how many did we find? Use when missing a case is costly (medical, fraud).
โš–๏ธ
F1 Score
2 ร— (Pร—R) / (P+R)
Harmonic mean of precision and recall. Best single metric for imbalanced datasets.
Using the example above (TP=45, FP=10, FN=5, TN=40):
Accuracy = (45+40)/(45+10+5+40) = 85/100 = 85%
Precision = 45/(45+10) = 45/55 = 81.8%
Recall = 45/(45+5) = 45/50 = 90%
F1 = 2 ร— (0.818 ร— 0.90) / (0.818+0.90) = 85.7%
Part 3
Which Metric Should You Use?
Rule of thumb: If one class is rare (less than 20% of data), don't trust accuracy alone. Use Recall if missing rare cases is dangerous. Use Precision if false alarms are expensive. Use F1 if both matter.
Part 4
Overfitting vs Underfitting

Your model can fail in two opposite ways. The key is finding the right balance:

๐Ÿ˜• Underfitting
Train accuracy: 60%
Test accuracy: 58%
Model is too simple. Missed important patterns. Fix: deeper tree, more features, more epochs.
โœ… Just Right
Train accuracy: 92%
Test accuracy: 89%
Model learned real patterns, generalises well to new data. This is the goal.
โš ๏ธ Overfitting
Train accuracy: 99%
Test accuracy: 63%
Model memorised training data but fails on new data. Fix: simpler model, more data, regularisation.
How to detect overfitting: Training accuracy is much higher than test accuracy (gap > 10โ€“15%). This is why we always track BOTH metrics โ€” not just the training score.
Try It in Colab
Calculate All Metrics in Python
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (accuracy_score, precision_score,
                              recall_score, f1_score, confusion_matrix)
import pandas as pd

# Load imbalanced-style dataset (breast cancer: 212 malignant, 357 benign)
data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

model = DecisionTreeClassifier(max_depth=5, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# All four metrics
print(f"Accuracy:  {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall:    {recall_score(y_test, y_pred):.3f}")
print(f"F1 Score:  {f1_score(y_test, y_pred):.3f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(pd.DataFrame(cm,
    index=['Actual: Malignant', 'Actual: Benign'],
    columns=['Predicted: Malignant', 'Predicted: Benign']))

# Overfitting check
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, y_pred)
print(f"\nTrain accuracy: {train_acc:.3f}")
print(f"Test accuracy:  {test_acc:.3f}")
print(f"Gap: {train_acc - test_acc:.3f}  (aim for < 0.05)")

๐Ÿงช Check Your Understanding โ€” Lesson 5 Quiz

1. A model detects rare fraud (1% of transactions are fraud). It predicts "not fraud" for everything. Accuracy = 99%. Is this a good model?
a) Yes โ€” 99% is excellent
b) No โ€” it misses every single fraud case (Recall = 0). Accuracy is misleading here.
c) It depends on the dataset size
d) Yes, if the business doesn't mind missing some fraud
2. False Negatives (FN) are when the model:
a) Correctly identifies a negative case
b) Predicts positive when the actual answer is negative (false alarm)
c) Predicts negative when the actual answer is positive (missed case)
d) Makes no prediction at all
3. For a cancer screening AI, which metric should you maximise?
a) Accuracy
b) Precision (avoid false alarms)
c) Recall (catch every sick patient, even if there are false alarms)
d) Training speed
4. Your model has Train Accuracy = 98% and Test Accuracy = 64%. This is:
a) Underfitting โ€” the model is too simple
b) Perfect โ€” high train accuracy is the goal
c) Overfitting โ€” the model memorised training data but fails on new data
d) Normal โ€” test accuracy is always lower
5. F1 Score is most useful when:
a) You have perfectly balanced classes (50% / 50%)
b) You have imbalanced classes and both false positives and false negatives are costly
c) Recall doesn't matter in your problem
d) You need a simple metric students can calculate mentally
6. Precision = TP / (TP + FP) measures:
a) Of all actual positives, how many did we correctly catch?
b) Of all predictions of positive, how many were actually positive?
c) The total number of correct predictions
d) The difference between train and test accuracy
7. Underfitting occurs when:
a) Training accuracy is much higher than test accuracy
b) Both training and test accuracy are low โ€” the model is too simple to learn real patterns
c) The confusion matrix has many True Positives
d) F1 score equals accuracy
8. For a spam filter, you prefer high Precision because:
a) Catching all spam is more important than false alarms
b) Missing spam is the most dangerous outcome
c) Incorrectly marking an important email as spam (FP) is worse than letting one spam through (FN)
d) Precision is always better than Recall for text data
โ† Lesson 4: First Classifier Lesson 6: Regression โ†’