Devansh interned at an Ahmedabad fintech startup. They wanted to add an "AI categorise my expenses" feature. He wasn't writing the model — he was the product manager: defining what success looks like, what could go wrong, and how to ship it without breaking the existing app.
The model was 84% accurate in offline testing. Six weeks later, after shadow deployment, A/B test, and gradual rollout, the feature reached 100% of users with 2.3% increase in 7-day retention. The PM craft mattered more than the model.
An AI Product Requirements Document differs from a regular PRD in five sections:
- Problem statement (1 paragraph): Who hurts, how badly, today. "Users spend 6 minutes/week categorising 80 expenses by hand. 40% give up after a month."
- Success metrics: Both offline (model accuracy ≥ 80%) and online (% of users who manually correct < 15%, +2% week-2 retention).
- Failure modes: What happens when the model is wrong? Wrong category is reversible (1 tap to fix); wrong PIN-entry would not be. List the worst plausible failure and the mitigation.
- Data requirements: Source, labels, refresh cadence, privacy controls, DPDPA basis (consent? legitimate use?). Will we send transaction text to a third-party API? If yes, who and under what DPA?
- Rollout plan: Shadow → 1% canary → 10% → 50% → 100%. Kill switch at every stage. Define what triggers rollback.
Offline Metrics (Model Quality)
Accuracy, F1, AUC, MAPE on a held-out test set. Cheap to measure but doesn't predict business impact.
Online Metrics (Business Outcome)
Retention, conversion, revenue per user, NPS. The metrics that actually matter — but only measurable in production via A/B test.
Guardrail Metrics
Things that must not regress: latency P99, crash rate, support ticket volume. Catch bad releases that the success metric misses.
Counter Metrics
Watch for unintended harm: complaint rate, accessibility breaks, demographic accuracy disparities.
| Stage | What Happens | Trigger to Advance |
|---|---|---|
| 1. Offline eval | Test model on held-out historical data | Beats baseline by ≥ 10% on F1 |
| 2. Shadow | Run model on 100% of live traffic but show old result; log both | Online metrics ≥ offline within 5% |
| 3. 1% canary | 1% of users see new feature | No regression in guardrails for 5 days |
| 4. A/B test (50/50) | Random 50% see new, 50% see old; measure online metrics | p-value < 0.05 on success metric, no guardrail regression |
| 5. 100% rollout | Everyone gets the feature | — |
| 6. Monitor + iterate | Dashboards, weekly review, retrain cadence | — |
The most common mistake junior PMs make: declaring a winner from a tiny sample because the average looks better. Real A/B testing:
from scipy import stats
# Treatment: 4,832 users, 1,287 retained at day 7 → 26.6% retention
# Control: 4,801 users, 1,202 retained at day 7 → 25.0% retention
n_t, k_t = 4832, 1287
n_c, k_c = 4801, 1202
# Two-proportion z-test
p_pool = (k_t + k_c) / (n_t + n_c)
se = (p_pool * (1 - p_pool) * (1/n_t + 1/n_c)) ** 0.5
z = ((k_t/n_t) - (k_c/n_c)) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z)))
print(f"Lift: {(k_t/n_t - k_c/n_c)*100:+.2f}pp, z={z:.2f}, p={p_value:.4f}")
# Lift: +1.65pp, z=2.04, p=0.0411 → significant at α=0.05
Sample size first: Use a power calculator before you start. To detect a 1.5pp lift with 80% power and α=0.05 you need ~5,000 users per arm. If your traffic can't deliver that in 2 weeks, the test is underpowered and shouldn't run.