Divya built a disease detection model. It predicted whether a patient had a rare disease โ and got 99% accuracy! She was thrilled. But her doctor-aunt looked at the numbers carefully. "Wait โ how many patients actually had the disease in your test set?" she asked. "Just 1 out of 100," said Divya. Her aunt smiled: "So if the model just always predicts 'no disease' โ it's 99% accurate. But it misses every single sick person. That's not a good medical AI."
Today we'll learn why accuracy alone is often not enough โ and what metrics actually tell you whether your model is useful.
When a binary classifier runs on test data, every prediction falls into one of four boxes:
- TP (True Positive): Correctly predicted as positive โ the model got it right for positives
- TN (True Negative): Correctly predicted as negative โ the model got it right for negatives
- FP (False Positive): Predicted positive but actually negative โ "false alarm"
- FN (False Negative): Predicted negative but actually positive โ "missed case" (often dangerous in medical AI!)
Using the confusion matrix numbers, we can calculate four metrics that each measure something different:
Accuracy = (45+40)/(45+10+5+40) = 85/100 = 85%
Precision = 45/(45+10) = 45/55 = 81.8%
Recall = 45/(45+5) = 45/50 = 90%
F1 = 2 ร (0.818 ร 0.90) / (0.818+0.90) = 85.7%
- Medical diagnosis (cancer detection): Use Recall โ missing a sick person (FN) is far worse than a false alarm
- Spam filter: Use Precision โ marking a real email as spam (FP) is worse than letting one spam through
- Fraud detection: Use F1 โ both false alarms and missed fraud are costly
- Balanced datasets (50/50 classes): Use Accuracy โ it's reliable when both classes appear equally
Your model can fail in two opposite ways. The key is finding the right balance:
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import (accuracy_score, precision_score,
recall_score, f1_score, confusion_matrix)
import pandas as pd
# Load imbalanced-style dataset (breast cancer: 212 malignant, 357 benign)
data = load_breast_cancer()
X, y = data.data, data.target
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
model = DecisionTreeClassifier(max_depth=5, random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
# All four metrics
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"Precision: {precision_score(y_test, y_pred):.3f}")
print(f"Recall: {recall_score(y_test, y_pred):.3f}")
print(f"F1 Score: {f1_score(y_test, y_pred):.3f}")
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(pd.DataFrame(cm,
index=['Actual: Malignant', 'Actual: Benign'],
columns=['Predicted: Malignant', 'Predicted: Benign']))
# Overfitting check
train_acc = accuracy_score(y_train, model.predict(X_train))
test_acc = accuracy_score(y_test, y_pred)
print(f"\nTrain accuracy: {train_acc:.3f}")
print(f"Test accuracy: {test_acc:.3f}")
print(f"Gap: {train_acc - test_acc:.3f} (aim for < 0.05)")