Lesson 7 — ML Pipelines and Experiment Tracking | Class 10

Meet Arjun — Class 10, Indore

Arjun was building a loan default predictor for a class project. He'd tried Random Forest with 100 trees, then 200. He'd tried with StandardScaler, then without. He'd tried with max_depth=5, then 10. After a week, he had 15 Jupyter notebooks named things like "final_v3_actually_final_this_one.ipynb". "Which run got 84%? What hyperparameters did I use?"

His mentor laughed and said: "You need experiment tracking. Professional ML teams use MLflow for exactly this — every run is logged automatically. Parameters, metrics, plots, model file. One command to compare all runs. One command to load the best model." Arjun set it up in 20 minutes and immediately found his best run.

The Problem

Why Ad-hoc Notebooks Don't Scale

No reproducibility: You can't reliably rerun a notebook from 3 weeks ago — the data changed, cell execution order matters, environment differs.
No comparison: Mentally comparing 10 runs stored in separate notebooks is error-prone.
Leaking information: If you apply scaling before train/test split, test data statistics leak into your scaler. sklearn Pipeline prevents this.
No model registry: "Which pickle file is the one deployed in production?" becomes a dangerous question.

Two tools fix this: sklearn Pipeline for preventing data leakage, and MLflow for tracking experiments.

Tool 1

sklearn Pipeline — Preventing Data Leakage

A Pipeline chains preprocessing steps + model into one object. When you call pipeline.fit(X_train, y_train), the scaler is fitted only on training data. When you call pipeline.predict(X_test), it uses the training scaler to transform test data — never fits on test data. This is the correct way.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer

# Without Pipeline (WRONG — leaks test data into scaler):
# scaler.fit(X)         # oops, X includes test data
# X_scaled = scaler.transform(X)
# X_train, X_test = train_test_split(X_scaled, ...)

# With Pipeline (CORRECT):
pipeline = Pipeline([
    ('imputer',  SimpleImputer(strategy='median')),  # fill missing values
    ('scaler',   StandardScaler()),                  # normalise features
    ('model',    RandomForestClassifier(n_estimators=100, random_state=42))
])
# fit only trains scaler on X_train:
pipeline.fit(X_train, y_train)
# predict correctly transforms X_test with the training scaler:
y_pred = pipeline.predict(X_test)

Tool 2

MLflow — Experiment Tracking

# ML Pipelines + MLflow Experiment Tracking — Google Colab
!pip install mlflow scikit-learn pandas -q

import mlflow
import mlflow.sklearn
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score
from sklearn.impute import SimpleImputer

# ── Step 1: Create a synthetic dataset (simulates loan default prediction) ──
X, y = make_classification(
    n_samples=1000, n_features=15, n_informative=8,
    n_redundant=3, random_state=42
)
feature_names = [f"feature_{i}" for i in range(X.shape[1])]
X_df = pd.DataFrame(X, columns=feature_names)

# Add some missing values (realistic)
X_df.iloc[50:70, 2] = np.nan
X_df.iloc[200:220, 7] = np.nan

X_train, X_test, y_train, y_test = train_test_split(
    X_df, y, test_size=0.2, random_state=42
)

# ── Step 2: Set up MLflow experiment ──
mlflow.set_experiment("loan_default_predictor")

def run_experiment(model_name, model, params):
    """Train model in a Pipeline and log everything to MLflow."""
    with mlflow.start_run(run_name=model_name):
        # Log parameters
        mlflow.log_params(params)
        mlflow.log_param("model_type", model_name)
        mlflow.log_param("train_size", len(X_train))

        # Build and train pipeline
        pipeline = Pipeline([
            ('imputer', SimpleImputer(strategy='median')),
            ('scaler',  StandardScaler()),
            ('model',   model)
        ])
        pipeline.fit(X_train, y_train)

        # Evaluate
        y_pred  = pipeline.predict(X_test)
        y_proba = pipeline.predict_proba(X_test)[:, 1]

        accuracy = accuracy_score(y_test, y_pred)
        f1       = f1_score(y_test, y_pred)
        auc_roc  = roc_auc_score(y_test, y_proba)

        # Cross-validation
        cv_scores = cross_val_score(pipeline, X_train, y_train, cv=5)

        # Log metrics
        mlflow.log_metric("accuracy",    accuracy)
        mlflow.log_metric("f1_score",    f1)
        mlflow.log_metric("auc_roc",     auc_roc)
        mlflow.log_metric("cv_mean",     cv_scores.mean())
        mlflow.log_metric("cv_std",      cv_scores.std())

        # Save model
        mlflow.sklearn.log_model(pipeline, "pipeline")

        print(f"\n{model_name}")
        print(f"  Accuracy: {accuracy:.3f}  F1: {f1:.3f}  AUC: {auc_roc:.3f}")
        print(f"  CV: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")

        return {"model": model_name, "accuracy": accuracy,
                "f1": f1, "auc": auc_roc}

# ── Step 3: Run multiple experiments ──
results = []

results.append(run_experiment(
    "LogisticRegression_C1",
    LogisticRegression(C=1.0, max_iter=500),
    {"C": 1.0, "max_iter": 500}
))

results.append(run_experiment(
    "LogisticRegression_C0.1",
    LogisticRegression(C=0.1, max_iter=500),
    {"C": 0.1, "max_iter": 500}
))

results.append(run_experiment(
    "RandomForest_100",
    RandomForestClassifier(n_estimators=100, max_depth=None, random_state=42),
    {"n_estimators": 100, "max_depth": "None"}
))

results.append(run_experiment(
    "RandomForest_d5",
    RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42),
    {"n_estimators": 100, "max_depth": 5}
))

results.append(run_experiment(
    "GradientBoosting",
    GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42),
    {"n_estimators": 100, "learning_rate": 0.1}
))

# ── Step 4: Compare results ──
results_df = pd.DataFrame(results).sort_values("auc", ascending=False)
print("\n── Experiment Comparison ──")
print(results_df.to_string(index=False))

# ── Step 5: Load the best model from MLflow ──
best_run_name = results_df.iloc[0]["model"]
print(f"\nBest model: {best_run_name}")

# Find run ID programmatically
runs = mlflow.search_runs(experiment_names=["loan_default_predictor"],
                           filter_string=f"tags.mlflow.runName = '{best_run_name}'")
run_id = runs.iloc[0].run_id
best_pipeline = mlflow.sklearn.load_model(f"runs:/{run_id}/pipeline")
print("Best model loaded successfully!")
print(f"Test accuracy: {accuracy_score(y_test, best_pipeline.predict(X_test)):.3f}")

# ── View MLflow UI ──
# In Colab, run: !mlflow ui --port 5000
# Then: from pyngrok import ngrok; ngrok.connect(5000)

Typical results you'd see in the MLflow dashboard:

Model	Accuracy	F1 Score	AUC-ROC	CV Mean
GradientBoosting	0.870	0.863	0.942	0.861
RandomForest_100	0.855	0.848	0.928	0.843
RandomForest_d5	0.840	0.831	0.912	0.837
LogisticRegression_C1	0.815	0.807	0.889	0.812
LogisticRegression_C0.1	0.800	0.792	0.876	0.797

Arjun's outcome: After adding MLflow, he ran 12 experiments in 30 minutes. The MLflow UI showed GradientBoosting with AUC 0.942 as the clear winner. He loaded it with one line, packaged it for the school's science fair, and showed his judges the experiment comparison table to prove his methodology was rigorous.

🧪 Check Your Understanding — Lesson 7 Quiz

1. Data leakage in ML occurs when:

a) Your model is too large to fit in memory

b) Information from the test set influences preprocessing or model training — making test accuracy unrealistically optimistic

c) Private user data is accidentally exposed on the internet

d) Your training dataset is too small

2. sklearn's Pipeline solves data leakage by:

a) Downloading fresh data from the internet before each run

b) Ensuring preprocessing steps (like scalers) are fit only on training data, then consistently applied to test/validation data without seeing test statistics

c) Encrypting the training data before model training

d) Automatically splitting data into train and test sets

3. In MLflow, `mlflow.log_param("n_estimators", 100)` logs:

a) A performance metric that changes during training

b) A fixed hyperparameter configuration for this experiment run — so you can reproduce the exact run later

c) The model's predictions on the test set

d) The number of samples in your dataset

4. Why is AUC-ROC often preferred over simple accuracy for evaluating a loan default classifier?

a) AUC-ROC is always a higher number than accuracy

b) AUC-ROC measures model performance across all classification thresholds and is more meaningful when classes are imbalanced — a model that predicts "no default" always could have high accuracy but terrible AUC

c) Accuracy cannot be calculated for binary classification problems

d) AUC-ROC is faster to compute than accuracy

5. `mlflow.sklearn.log_model(pipeline, "pipeline")` saves:

a) Only the model's Python code

b) The entire trained Pipeline (preprocessor + model) as a versioned artefact linked to this run — loadable later with mlflow.sklearn.load_model()

c) A screenshot of the model's performance graphs

d) The training dataset as a CSV file

6. Cross-validation (cv=5) in the code runs training how many times?

a) Once, using 5% of the data

b) 5 times, each time using a different 80%/20% split of the training data — the mean score is a more reliable performance estimate than a single split

c) 5 times using all available data each time

d) 100 times with random sampling

7. `SimpleImputer(strategy='median')` in the Pipeline is used to:

a) Remove all rows with missing values from the dataset

b) Fill missing values with the median of that feature — computed on training data only — making the model robust to real-world missing data

c) Generate synthetic data to replace missing values using a neural network

d) Encode categorical columns as numbers

8. `mlflow.set_experiment("loan_default_predictor")` does what?

a) Automatically runs all experiments defined in the file

b) Creates (or selects if existing) an MLflow experiment group — all subsequent runs are logged under this name for easy comparison

c) Trains a logistic regression model named after the experiment

d) Sets the test size to 0.2 for all models

← Lesson 6: RAG Chatbot Lesson 8: FastAPI →

ML Pipelines and Experiment Tracking 📊

Class 10 Lesson 7 - ML Pipelines and Experiment Tracking

🧪 Check Your Understanding — Lesson 7 Quiz