Lesson 10 — AI Product Management | Class 12

Story

Devansh's First PM Role

👨‍💼 Devansh · Ahmedabad · Age 17

Devansh interned at an Ahmedabad fintech startup. They wanted to add an "AI categorise my expenses" feature. He wasn't writing the model — he was the product manager: defining what success looks like, what could go wrong, and how to ship it without breaking the existing app.

The model was 84% accurate in offline testing. Six weeks later, after shadow deployment, A/B test, and gradual rollout, the feature reached 100% of users with 2.3% increase in 7-day retention. The PM craft mattered more than the model.

PRD

The AI PRD Template

An AI Product Requirements Document differs from a regular PRD in five sections:

Problem statement (1 paragraph): Who hurts, how badly, today. "Users spend 6 minutes/week categorising 80 expenses by hand. 40% give up after a month."
Success metrics: Both offline (model accuracy ≥ 80%) and online (% of users who manually correct < 15%, +2% week-2 retention).
Failure modes: What happens when the model is wrong? Wrong category is reversible (1 tap to fix); wrong PIN-entry would not be. List the worst plausible failure and the mitigation.
Data requirements: Source, labels, refresh cadence, privacy controls, DPDPA basis (consent? legitimate use?). Will we send transaction text to a third-party API? If yes, who and under what DPA?
Rollout plan: Shadow → 1% canary → 10% → 50% → 100%. Kill switch at every stage. Define what triggers rollback.

Metrics

Offline vs Online Metrics

Offline Metrics (Model Quality)

Accuracy, F1, AUC, MAPE on a held-out test set. Cheap to measure but doesn't predict business impact.

Online Metrics (Business Outcome)

Retention, conversion, revenue per user, NPS. The metrics that actually matter — but only measurable in production via A/B test.

Guardrail Metrics

Things that must not regress: latency P99, crash rate, support ticket volume. Catch bad releases that the success metric misses.

Counter Metrics

Watch for unintended harm: complaint rate, accessibility breaks, demographic accuracy disparities.

The PM trap: Optimising for offline accuracy alone often regresses business metrics — a 92%-accurate model that is slow or scary can lose more revenue than the 84% model that is fast and friendly.

Rollout

Shadow → Canary → Full

Stage	What Happens	Trigger to Advance
1. Offline eval	Test model on held-out historical data	Beats baseline by ≥ 10% on F1
2. Shadow	Run model on 100% of live traffic but show old result; log both	Online metrics ≥ offline within 5%
3. 1% canary	1% of users see new feature	No regression in guardrails for 5 days
4. A/B test (50/50)	Random 50% see new, 50% see old; measure online metrics	p-value < 0.05 on success metric, no guardrail regression
5. 100% rollout	Everyone gets the feature	—
6. Monitor + iterate	Dashboards, weekly review, retrain cadence	—

Always have a kill switch. A single feature flag that can disable the AI feature within 60 seconds. AI features have failure modes that traditional code does not — model decay, adversarial inputs, prompt injection. Without a kill switch you ship without insurance.

Stats

A/B Testing — Real Statistical Significance

The most common mistake junior PMs make: declaring a winner from a tiny sample because the average looks better. Real A/B testing:

from scipy import stats

# Treatment: 4,832 users, 1,287 retained at day 7  → 26.6% retention
# Control:   4,801 users, 1,202 retained at day 7  → 25.0% retention

n_t, k_t = 4832, 1287
n_c, k_c = 4801, 1202

# Two-proportion z-test
p_pool = (k_t + k_c) / (n_t + n_c)
se = (p_pool * (1 - p_pool) * (1/n_t + 1/n_c)) ** 0.5
z = ((k_t/n_t) - (k_c/n_c)) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z)))
print(f"Lift: {(k_t/n_t - k_c/n_c)*100:+.2f}pp, z={z:.2f}, p={p_value:.4f}")
# Lift: +1.65pp, z=2.04, p=0.0411 → significant at α=0.05

Sample size first: Use a power calculator before you start. To detect a 1.5pp lift with 80% power and α=0.05 you need ~5,000 users per arm. If your traffic can't deliver that in 2 weeks, the test is underpowered and shouldn't run.

Devansh's outcome: Feature shipped to 100% of users with +2.3% retention (p=0.003). His PRD template is now the company's standard for every AI feature. He learned that the PM craft — clear metrics, safe rollout, honest measurement — multiplies the value of any model.

📝 Check Your Understanding (8 Questions)

1. What is the difference between offline and online metrics?

a) Offline metrics are computed without internet; online metrics require internet

b) Offline metrics measure model quality on held-out historical data (accuracy, F1, AUC) and are cheap; online metrics measure business outcomes (retention, revenue) only available from production A/B tests — and are the metrics that actually matter

c) Offline metrics are for training, online metrics are for inference

d) There is no meaningful difference

2. Why does Devansh include a 'failure modes' section in his PRD?

a) It is required by Indian fintech regulation

b) AI features fail in ways traditional code does not — wrong predictions, drift, prompt injection. Listing the worst plausible failure and its mitigation forces the team to design for graceful degradation rather than discovering the failure in production

c) It makes the PRD longer and more impressive

d) Failure modes have no real impact on shipping

3. What is the purpose of shadow deployment?

a) To save GPU cost during development

b) Run the new model on 100% of real traffic but do not show its output to users — log predictions side-by-side with the existing system. This validates online metrics, surfaces edge cases, and stress-tests infrastructure with zero user risk before any rollout

c) To test the model on synthetic data

d) To hide the model from competitors

4. What are guardrail metrics and why does Devansh track them in every A/B test?

a) They are metrics displayed on a literal guardrail

b) Things that must NOT regress (P99 latency, crash rate, support ticket volume) — even when the success metric improves, regressions in guardrails can mean the feature is causing hidden harm; without them, a 'winning' test can ship a net-negative experience

c) They measure how many guardrails are deployed

d) They are a synonym for offline metrics

5. Why does Devansh do a power calculation before running the A/B test?

a) It is required by the company's CFO

b) To know how many users per arm are needed to detect the expected lift with adequate statistical power (e.g., 80%) at α=0.05 — running an underpowered test wastes time and produces ambiguous results that cannot reject either hypothesis

c) Power calculations make the test run faster

d) It is the only way to compute p-values

6. Why is a kill switch (single feature flag with 60-second propagation) essential for AI features?

a) It is required by Kubernetes for any deployment

b) AI features have failure modes traditional code lacks — model drift, adversarial inputs, prompt injection, sudden regression after a third-party API change. Without an instant kill switch, every release ships without insurance against catastrophic failures discovered post-launch

c) It speeds up CI/CD pipelines

d) It is required by the AI safety regulator

7. What DPDPA-related question must an AI PRD answer for any feature processing user data?

a) Whether the model is open-source or proprietary

b) What is the lawful basis (consent or legitimate use), where is the data processed (in-country or cross-border), who is the data fiduciary, what data minimisation is applied, and is there a third-party processor agreement (DPA) — DPDPA 2023 requires this clarity before launch

c) What programming language the model is written in

d) Whether the model is bigger than 7 billion parameters

8. What is the central insight of Devansh's lesson about AI product management?

a) The most important PM skill is writing perfect specs

b) The PM craft — clear success metrics, honest measurement, staged rollout with kill switches — multiplies (or destroys) the value of any AI model; an 84%-accurate model shipped well outperforms a 92%-accurate model shipped badly

c) PMs should write the model code themselves

d) AI features should always be shipped to 100% of users on day 1

← Lesson 9: Time Series Lesson 11: College & Career →

AI Product Management 📋

Class 12 Lesson 10 - AI Product Management

Offline Metrics (Model Quality)

Online Metrics (Business Outcome)

Guardrail Metrics

Counter Metrics

📝 Check Your Understanding (8 Questions)