Lesson 6 — Production ML: Monitoring and Drift | Class 11

Story

Preethi's Silent Degradation

Preethi, 16, from Coimbatore had built a loan approval model for a class project — a gradient-boosted classifier trained on 2022 credit data, 91% accuracy on the test set. She "deployed" it (as a FastAPI on a free tier cloud service) and thought she was done.

Three months later, a friend pointed out the model was approving nearly everyone. Accuracy had quietly dropped to 61% — just above random chance. The problem? India's credit environment had shifted. New income patterns post-COVID, new UPI transaction behaviours, new borrower demographics. The 2022 data no longer reflected 2024 reality.

"This is data drift," her teacher explained. "Your model was correct when deployed. The world changed. Every production model needs a monitoring system that alerts you before users notice." Preethi spent a day setting up Evidently AI. Her next model has never gone stale.

Section 1

Why Models Degrade: Types of Drift

Drift Type	What Changes	Loan Model Example	Detection Method
Data Drift	Input feature distribution P(X) changes	Borrower income distribution shifts post-COVID — typical applicant income was ₹30k/month in 2022, now ₹45k/month	PSI, KS test, chi-squared
Concept Drift	Relationship P(Y\|X) changes — same features, different correct output	A UPI score of 750 was "excellent" in 2022 but is now "average" because scoring criteria changed	Prediction accuracy monitoring
Label Drift	Output label distribution P(Y) changes	Default rate in the population changed from 8% to 14%	Monitor prediction distribution
Upstream Drift	A data pipeline changes, altering feature values	Bureau changed how they calculate CIBIL score	Feature statistics monitoring

Section 2

Evidently AI: Automated Drift Reports

# pip install evidently pandas scikit-learn

import pandas as pd
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset, ClassificationPreset

# ── Load reference (training) data and current (production) data ──
reference = pd.read_csv("train_data_2022.csv")     # 10,000 rows
current   = pd.read_csv("production_data_2024.csv") # last 30 days

# ── 1. Data Drift Report ──────────────────────────────────────────
data_drift_report = Report(metrics=[DataDriftPreset()])
data_drift_report.run(reference_data=reference, current_data=current)
data_drift_report.save_html("drift_report.html")   # open in browser

# ── 2. Model Performance Report (needs ground truth labels) ───────
# For loan approvals: labels available after 30-day default window
perf_report = Report(metrics=[ClassificationPreset()])
perf_report.run(
    reference_data=reference.assign(prediction=reference['approved']),
    current_data=current.assign(prediction=current['model_output'])
)
perf_report.save_html("performance_report.html")

# ── 3. Programmatic drift detection for alerting ─────────────────
from evidently.test_suite import TestSuite
from evidently.tests import TestNumberOfDriftedColumns

suite = TestSuite(tests=[
    TestNumberOfDriftedColumns(lt=3),   # alert if more than 3 columns drift
])
suite.run(reference_data=reference, current_data=current)
result = suite.as_dict()

if not result['summary']['all_passed']:
    print("ALERT: Significant data drift detected!")
    print(result['summary']['failed'])

Section 3

Statistical Tests for Drift Detection

Population Stability Index (PSI) — for continuous features

PSI = Σ (P_current - P_reference) * ln(P_current / P_reference) PSI < 0.10: No significant drift PSI 0.10–0.25: Moderate drift — investigate PSI > 0.25: Significant drift — retrain required

import numpy as np
from scipy import stats

def psi(reference, current, n_bins=10):
    """Population Stability Index for continuous features."""
    bins = np.percentile(reference, np.linspace(0, 100, n_bins + 1))
    bins[0] = -np.inf; bins[-1] = np.inf

    ref_counts, _ = np.histogram(reference, bins=bins)
    cur_counts, _ = np.histogram(current, bins=bins)

    ref_pct = ref_counts / len(reference)
    cur_pct = cur_counts / len(current)

    # Avoid division by zero / log(0)
    ref_pct = np.where(ref_pct == 0, 1e-6, ref_pct)
    cur_pct = np.where(cur_pct == 0, 1e-6, cur_pct)

    return np.sum((cur_pct - ref_pct) * np.log(cur_pct / ref_pct))

# ── Kolmogorov-Smirnov test — for continuous features ────────────
def ks_drift_test(reference, current, significance=0.05):
    """KS test: p-value < significance → drift detected."""
    stat, p_value = stats.ks_2samp(reference, current)
    drift = p_value < significance
    return {"ks_statistic": stat, "p_value": p_value, "drift": drift}

# ── Chi-squared test — for categorical features ───────────────────
def chi2_drift_test(reference_counts, current_counts, significance=0.05):
    """Chi-squared test for categorical column drift."""
    stat, p_value = stats.chisquare(current_counts,
                                     f_exp=reference_counts * sum(current_counts) / sum(reference_counts))
    return {"chi2_statistic": stat, "p_value": p_value, "drift": p_value < significance}

# Example usage:
income_ref = reference['monthly_income'].values
income_cur = current['monthly_income'].values

print(f"PSI (income): {psi(income_ref, income_cur):.4f}")
print(ks_drift_test(income_ref, income_cur))

Section 4

Retraining Triggers and A/B Testing

Two strategies for deciding when to retrain:

Scheduled retraining: Retrain every N days regardless of drift. Simple but wasteful — you retrain even when the model is fine.
Metric-based retraining: Retrain when PSI > 0.25 OR accuracy drops below threshold. Efficient but requires monitoring infrastructure.

# ── A/B test two model versions in production ────────────────────
import random

def route_request(user_id: str, request_data: dict) -> dict:
    """Route 10% of traffic to new model v2 for comparison."""
    # Stable hash ensures same user always gets same version
    use_v2 = (hash(user_id) % 100) < 10   # 10% to v2

    if use_v2:
        result = model_v2.predict(request_data)
        result['model_version'] = 'v2'
    else:
        result = model_v1.predict(request_data)
        result['model_version'] = 'v1'

    # Log everything for later analysis
    log_prediction(user_id, request_data, result)
    return result

# After 30 days (once ground truth labels are available):
# Compare v1 vs v2 accuracy, false positive rate, and PSI of their inputs
# If v2 wins on all metrics with statistical significance → promote to 100%

# ── Complete monitoring pipeline (runs daily via cron/Airflow) ───
def daily_monitoring_check():
    ref_data = load_training_data()         # stored in your data warehouse
    prod_data = load_last_7_days_production()

    # 1. Check data drift
    drift_results = {}
    for col in MONITORED_FEATURES:
        drift_results[col] = ks_drift_test(ref_data[col], prod_data[col])

    drifted_cols = [c for c,r in drift_results.items() if r['drift']]

    # 2. Check prediction distribution
    ref_approval_rate = ref_data['approved'].mean()
    cur_approval_rate = prod_data['model_prediction'].mean()

    # 3. Alert if necessary
    if len(drifted_cols) > 3:
        send_alert(f"DATA DRIFT: {drifted_cols} — consider retraining")

    if abs(cur_approval_rate - ref_approval_rate) > 0.15:
        send_alert(f"PREDICTION SHIFT: approval rate {cur_approval_rate:.1%} vs baseline {ref_approval_rate:.1%}")

    return {"status": "ok", "drifted_columns": drifted_cols}

The label delay problem: For loan defaults, you only know if a loan was "correctly approved" 30–90 days later. During this window, you can only monitor data drift and prediction distribution — not accuracy. Design your monitoring to work with delayed labels.

📉 Lesson 6 Quiz — Production ML Monitoring

1. Preethi's model accuracy dropped from 91% to 61% after 3 months without code changes. The most likely cause is:

a) The FastAPI server ran out of memory and started returning random predictions

b) Data drift or concept drift — the real-world distribution of borrower characteristics and/or the relationship between features and default risk changed after deployment. The model still encodes the 2022 patterns perfectly; the world has moved on. This is one of the most common failures in production ML.

c) The model overfitted to the test set during training and the effect became visible over time

d) Python garbage collection was incorrectly collecting the model weights

2. PSI = 0.32 for the "monthly_income" feature means:

a) 32% of the production borrowers have monthly income above the training mean

b) Significant drift — the income distribution of current applicants differs substantially from training data. PSI > 0.25 indicates the population has shifted enough that model performance is likely degraded. Retraining should be triggered immediately.

c) The model has 32% higher false positive rate than at training time

d) Moderate drift — investigate but no immediate action needed (PSI 0.10–0.25 threshold)

3. The KS test returns p_value = 0.003 for a feature (significance threshold = 0.05). This means:

a) The feature has 0.3% importance in the model — it can be safely dropped

b) The two distributions (training vs production) are significantly different — p_value < 0.05 means we reject the null hypothesis that the distributions are the same. Drift is detected. The smaller the p-value, the stronger the evidence of drift.

c) p_value > 0.003 is our threshold, so no drift is detected

d) The feature is not important enough to cause model degradation

4. Concept drift means P(Y|X) has changed while P(X) may be the same. A real-world example is:

a) The age distribution of loan applicants shifts from mostly 25–35 to mostly 35–50

b) A CIBIL score of 750 was considered "excellent" and strongly predicted repayment in 2022. In 2024, inflation and job market changes mean a 750-score borrower now has 2x higher default probability. The features are the same; the meaning of those features for predicting default has changed.

c) The model receives more API requests than it was designed to handle

d) The model was trained on data with more class imbalance than the production data

5. A/B testing model versions in production uses hash(user_id) % 100 < 10 to route 10% to v2. Using user_id hash instead of random() ensures:

a) Users with high credit scores always get the better model version

b) The same user always receives the same model version across all their requests. This enables coherent user experience and prevents leakage bias — if a user randomly switched between v1 and v2 on different requests, you couldn't cleanly attribute downstream outcomes (like loan repayment) to one model version.

c) The split is exactly 10%/90% rather than an approximation from random sampling

d) Hashing is faster than random() for high-traffic production APIs

6. For loan default prediction, ground truth labels (did the borrower actually default?) are only available 30–90 days after the prediction was made. During this window, the monitoring system should:

a) Stop all monitoring until ground truth labels are available

b) Monitor data drift (feature distributions), prediction distribution (approval rate), and upstream data quality — all of which don't require ground truth. These are leading indicators of model degradation. Performance metrics can be computed retrospectively once the label delay has passed.

c) Use the model's own confidence scores as a proxy for ground truth accuracy

d) Retrain the model every 30 days regardless of drift as a safety measure

7. Metric-based retraining (trigger when PSI > 0.25) is generally preferred over scheduled retraining (every 7 days) because:

a) Scheduled retraining requires more expensive cloud infrastructure than metric-based

b) Scheduled retraining wastes compute retraining a healthy model and may still miss fast-occurring drift if it happens between schedules. Metric-based triggers retrain only when evidence of drift is present, reducing unnecessary compute cost while also catching unexpected sudden shifts.

c) Regulatory compliance in India mandates metric-based triggers for financial models

d) Scheduled retraining causes overfitting because the model sees the same data repeatedly

8. Upstream data drift (a data pipeline change affecting feature values) is the most dangerous drift type because:

a) Upstream drift always causes immediate model failure within minutes of occurring

b) It's often silent and misdiagnosed as model degradation rather than a data engineering problem. If the CIBIL bureau changes their scoring formula and your ETL pipeline doesn't update the transformation, the feature values delivered to your model are systematically wrong — but this looks identical to concept drift. Root-cause analysis is required to distinguish them.

c) Upstream drift cannot be detected by statistical tests — only manual inspection finds it

d) It causes data drift but never concept drift since the model's logic is unchanged

← Lesson 5: MLOps: Docker Lesson 7: Multi-modal AI →

Production ML: Monitoring and Drift 📉

Class 11 Lesson 6 - Production ML: Monitoring and Drift

Population Stability Index (PSI) — for continuous features

📉 Lesson 6 Quiz — Production ML Monitoring