Computed Metrics Grading¶
The single source of truth for "is this campaign good, bad, or neither?" across MCP, reporting, insights, and recommendations. There is no parallel grading path — every consumer bottoms out here.
MCP feed. The MCP serves
ComputedMetrics.to_dict()as the canonical campaign payload (get_full_campaign_stats,get_org_campaign_metrics, and per-campaign onget_full_org_stats) — not the raw cachedStatsResponse. There is no separate "MCP ComputedMetrics"; the MCP is one consumer of this model. Seeapp/mcp/schema_docs.pyfor the agent-facing contract and reporting.md for the cache it reads.Reading order: skim TL;DR → scan the section you need → jump to the Appendix for the gnarly bits (windowing, fallback semantics, promoted volume/rate). Each gotcha is anchored so other sections can link straight to it.
TL;DR¶
- Derive a
ComputedMetricsfrom aStatsResponse, per-classifier (RAW / EXPERIMENT / PROMOTED). - Grade each of its two splits (
first_purchase,lifetime) withgrade_metric→GOOD | NEUTRAL | BAD | UNKNOWN. - Decide with
decide_campaign(metrics)→PAUSE | SCALE | MONITOR | NO_ACTION. Pure function of grade × classifier. - Route the decision per-consumer using
maturity(JustLaunched / Early / Mature). Maturity does NOT change the decision — only who sees it. (why)
Anchor files¶
| Concept | File |
|---|---|
ComputedMetrics, MetricGrade, CampaignClassifier, CampaignMaturity, thresholds, grade_metric |
app/core/models/insights/computed_metrics.py |
decide_campaign, per-grader should_notify, headline/criteria/priority strings |
app/core/models/insights/grader.py |
Stats → ComputedMetrics derivation |
app/methods/insights/metric_derivation/ |
| Promoted-parent wiring on top of derivation | app/methods/computed_metrics.py |
The shape¶
ComputedMetrics
├── classifier: RAW | EXPERIMENT | PROMOTED
├── spend, weeks_live, mailers_sent
├── first_purchase: MetricSplit + first_purchase_grade: MetricGrade
├── lifetime: MetricSplit + lifetime_grade: MetricGrade
├── maturity: (derived) JustLaunched | Early | Mature ← routing metadata, not a grade input
└── basis: provenance string for MCP/reporting consumers
Each MetricSplit holds revenue, roas, cac, orders, optional uplift, optional win_probability. The two splits are graded independently. basis is provenance only — see Appendix → basis is not a grading input.
Classifier¶
Three classes, controlling which derivation runs and which decisions are eligible:
| Classifier | Trigger | Headline source |
|---|---|---|
RAW |
No holdout, no prior experiment | observed stats |
EXPERIMENT |
campaign.settings.holdout.enabled |
current holdout's experiment_results |
PROMOTED |
A HoldoutToFull association exists in campaign_associations |
parent's lifetime experiment_results |
Promoted classification requires a
campaign_associationslookup — call throughapp/methods/computed_metrics.py, notcompute_metricsdirectly. (why)
Derivation¶
ComputedMetrics is derived on every read — not stored. Same StatsResponse, different fields per classifier.
graph LR
AGG[(campaign_order_aggregates_hourly)] -->|fetch_hourly_aggregates<br/>± date window| HA[hourly rows]
RCP[(campaign_recipients)] -->|fetch_recipient_totals_by_campaign<br/>± date window| RT[recipient totals]
HA --> SUM[sum_aggregates_to_stats] --> SDICT[stats]
RT -->|mailers_sent, total_cost| DERIV[_compute_derived_metrics]
SDICT --> DERIV --> SDICT2[stats + roas/cpa]
HA -->|holdout rows| EXPR[_compute_experiment_results]
RT -->|Holdout count| EXPR --> ER[experiment_results]
SDICT2 --> CLASSIFY[classify]
ER --> CLASSIFY
CLASSIFY -->|RAW| DR[derive_raw]
CLASSIFY -->|EXPERIMENT| DE[derive_experiment]
CLASSIFY -->|PROMOTED| DP[derive_promoted]
DR & DE & DP --> CM[ComputedMetrics]
Field → source mapping¶
For PROMOTED, fields split into volume (from the child's stats) and rate (from the parent's lifetime experiment snapshot). This split is load-bearing — see Appendix → Promoted: volume vs rate.
| Field | RAW | EXPERIMENT | PROMOTED |
|---|---|---|---|
spend, weeks_live, mailers_sent |
stats.* |
stats.* |
child stats.* |
first_purchase.revenue |
stats.first_purchase_revenue |
experiment_results[first_order].incremental_revenue (fallback: stats.first_purchase_revenue) |
child_spend × parent_first_purchase_roas |
first_purchase.orders |
stats.first_purchase_orders |
experiment_results[first_order].orders (fallback: stats.first_purchase_orders) |
child stats.first_purchase_orders |
first_purchase.roas |
stats.first_purchase_roas |
experiment_results[first_order].incremental_roas (fallback: 0.0) |
parent experiment_results[first_order].incremental_roas |
first_purchase.cac |
spend / orders (or None) |
experiment_results[first_order].incremental_customer_acquisition_cost (fallback: 0.0) |
parent value, gated on parent's incremental orders |
lifetime.revenue |
stats.revenue ‖ stats.campaign_revenue |
experiment_results[all_orders].incremental_revenue (fallback: same as RAW) |
child_spend × parent_lifetime_roas |
lifetime.orders |
stats.campaign_orders ‖ stats.orders |
experiment_results[all_orders].orders (fallback: same as RAW) |
child stats.campaign_orders |
lifetime.roas |
stats.all_time_roas ‖ stats.roas |
experiment_results[all_orders].incremental_roas (fallback: 0.0) |
parent experiment_results[all_orders].incremental_roas |
lifetime.cac |
None |
None |
None |
uplift, win_probability |
unused | experiment_results[*].metric_uplift / .win_probability (no fallback) |
parent values |
basis |
"raw" |
"incremental(holdout)" |
"modeled(prior_experiment)" |
The fallback rows for EXPERIMENT and the windowing-induced zeros have surprising downstream effects — see Appendix → Experiment fallback and Appendix → Date windowing.
Maturity (routing, not grading)¶
JustLaunched : weeks_live <= 4 OR mailers_sent < 500
Early : weeks_live <= 10 OR mailers_sent < 5000
Mature : weeks_live > 10 AND mailers_sent >= 5000
A long-running campaign with few mailers is not mature — volume matters. Tunable in CampaignMaturity.derive.
| Consumer | Maturity gate |
|---|---|
decide_campaign |
none (grade × classifier only) |
CampaignGrader.should_notify (Slack, CSM insights) |
Early+ |
_evaluate_campaign (writes Recommendation row) |
Mature only |
CampaignGrader.priority |
Mature → HIGH; Early + PAUSE → MEDIUM; else LOW |
Maturity is not a decision input — see Appendix → Maturity routes, it doesn't decide.
Grading a split¶
grade_metric(roas, cac, weeks_live, has_orders) → MetricGrade is the only grading function. Runs once per split.
UNKNOWN weeks_live < 1 OR not has_orders OR roas is None
GOOD roas >= 2.0 AND (cac is None or cac <= 40)
BAD roas < 1.0 OR cac > 180
NEUTRAL everything else
| Threshold | Value |
|---|---|
GOOD_ROAS_THRESHOLD |
2.0 |
GOOD_CAC_THRESHOLD |
40.0 |
BAD_ROAS_THRESHOLD |
1.0 |
BAD_CAC_THRESHOLD |
180.0 |
All four live in computed_metrics.py. Do not redefine elsewhere.
Decision¶
decide_campaign(metrics) → Decision is pure grade × classifier:
flowchart TD
start([ComputedMetrics]) --> grade{lifetime_grade}
grade -->|BAD| pause[PAUSE]
grade -->|GOOD| cls{classifier}
grade -->|NEUTRAL / UNKNOWN| noop[NO_ACTION]
cls -->|EXPERIMENT| scale[SCALE]
cls -->|RAW or PROMOTED| monitor[MONITOR]
Three things that catch people:
PAUSEfires the momentlifetime_grade == BAD, regardless of maturity. Maturity only gates surfacing.SCALEis reserved for live experiments gradedGOOD. A doing-well promoted campaign isMONITOR— the scale decision already happened. (why)NEUTRAL(ROAS in[1.0, 2.0)) is intentionallyNO_ACTION. (why)
Where the decision is used¶
| Caller | Behavior |
|---|---|
campaign_grader.py |
Wraps in InsightData; should_notify requires Early+ AND non-NO_ACTION. |
recommendations.py::_evaluate_campaign |
SCALE → ScaleExperiment rec, PAUSE → PauseCampaign rec. Requires Mature. |
CampaignGrader (in grader.py) |
Headline/criteria/priority strings (CSM-facing). |
Worked examples¶
| Scenario | weeks_live | mailers_sent | classifier | lifetime ROAS | decide | should_notify | Rec written? |
|---|---|---|---|---|---|---|---|
| Brand-new send, no orders | 2 | 800 | RAW | n/a (UNKNOWN) | NO_ACTION | No | No |
| JustLaunched, BAD | 2 | 800 | RAW | 0.4 | PAUSE | No (maturity) | No |
| Live experiment crushing it (Early) | 6 | 8000 | EXPERIMENT | 3.1 | SCALE | Yes | No (needs Mature) |
| Same experiment, mature | 12 | 20000 | EXPERIMENT | 3.1 | SCALE | Yes | Yes — ScaleExperiment |
| Mature automation, mid-band | 14 | 18000 | RAW | 1.4 | NO_ACTION | No | No |
| Mature automation, underwater | 14 | 18000 | RAW | 0.7 | PAUSE | Yes | Yes — PauseCampaign |
| Promoted send, doing well | 12 | 20000 | PROMOTED | 2.5 | MONITOR | Yes | No |
Tuning¶
Knobs:
- Four numeric thresholds, the maturity rule, the decision tree — computed_metrics.py / grader.py.
- Per-consumer routing gates — next to each consumer (should_notify, _evaluate_campaign's Mature check).
Tune in place. Do not introduce a parallel grading path. If per-org tuning becomes necessary, extend grade_metric / decide_campaign to take an OrgConfig-like argument rather than branching at call sites.
Appendix: Gotchas & things to know¶
Stuff that looks right but isn't, plus the design choices that aren't obvious from the field tables. Each entry has a stable anchor so the sections above can link straight in.
Promoted: volume vs rate¶
A promoted campaign carries forward the rate (ROAS, CAC, uplift, win_probability) that its parent's holdout experiment measured, and multiplies it against the volume (spend, mailers) the child has actually shipped.
Two roles, two sources:
- Volume → child stats. Windowable. Narrows with
start_date/end_date. - Rate → parent's lifetime
experiment_results. Never windowed.
Mixing these is a silent data-quality bug — a date-windowed rate projected onto live volume is meaningless. The orchestration layer enforces this at runtime by fetching parent stats through fetch_parent_stats_for_promoted, and derive_promoted only accepts a PriorExperimentSnapshot — a windowed StatsResponse won't type-check as a rate source. Persistence of the snapshot at promotion time is still open; see Known limitations & open work.
Stationarity assumption. Using a lifetime parent ROAS assumes that rate is a roughly stable property of the audience × creative. If a parent experiment is old and the audience has shifted, the modeled revenue will drift from reality. There's no age-out today; consider it if this becomes a complaint.
Example. Child has shipped $10k this month; parent's lifetime incremental ROAS is 2.4. Modeled revenue = $24k — even though the child has no holdout of its own.
Date windowing has cascading effects¶
start_date / end_date apply to two underlying queries:
| Source | Date filter? | Affects |
|---|---|---|
fetch_hourly_aggregates |
yes, on hour_bucket |
order-derived columns (campaign_revenue, first_purchase_revenue, …) |
fetch_recipient_totals_by_campaign |
yes, on campaign_recipients.created_at |
mailers_sent, total_cost, holdout count |
Campaigns send in bursts; orders trickle in over months. A typical "last N days" window excludes the recipient rows (dated at send time) while still capturing orders. Knock-on effects:
mailers_sent == 0andtotal_cost == 0→roas/first_purchase_roas/cpacoerced to0.0(divide-by-zero guard). Note: notNone— coerced zero, which grades asBADifhas_ordersis true.- Holdout
recipient_count == 0→experiment_resultsreturns[], which triggers the experiment fallback.
Per-classifier effect under a date window:
| Classifier | Behavior |
|---|---|
| RAW | Partial degradation: revenue / orders retain real values; roas lands at 0.0. |
| EXPERIMENT | Falls back to observed stats when incremental can't compute. |
| PROMOTED | Volume narrows; rate stays lifetime; modeled_revenue narrows linearly with spend. |
Example. Campaign sent 50,000 pieces in March; orders are still rolling in. Querying "last 30 days" in May returns revenue from late-arriving orders, but mailers_sent = 0, so ROAS reports 0.0.
Experiment fallback returns roas = 0.0, not None¶
When experiment_results can't compute incremental (e.g. zero holdout in window), derive_experiment falls back to observed values from stats:
incremental_revenue is None AND experiment_orders is None
→ revenue ← stats.first_purchase_revenue (or stats.revenue ‖ campaign_revenue for lifetime)
→ orders ← stats.first_purchase_orders (or stats.campaign_orders ‖ orders for lifetime)
→ roas ← 0.0 (stable-shape signal: "incremental not computable here")
→ first_purchase.cac ← 0.0 (lifetime.cac stays None as always)
→ uplift / win_probability stay None
→ has_orders OR's in has_positive_count(fallback orders) so grading sees them
When incremental is computable, nothing changes.
Trap: the roas = 0.0 is a sentinel meaning "we have no holdout signal," not "the campaign earned zero." Consumers that need to distinguish the two should check mailers_sent == 0 or look for populated uplift / win_probability. A naive "is ROAS bad?" check will mis-grade these as BAD.
is_empty() / empty_dict() are temporary¶
ComputedMetrics.is_empty() returns True when both splits have revenue, orders, roas, cac all None. The bulk route uses this to emit empty_dict() (same-shape zeros) so the web-vs-API divergence check can compare stable shapes.
Both methods are temporary — revert to None once divergence work is done. With the experiment fallback in place, is_empty() now only fires when there is genuinely no data.
Maturity routes, it doesn't decide¶
A JustLaunched BAD campaign still produces PAUSE. Maturity only gates who sees the decision (Slack/CSM vs. recommendation row vs. nothing). Don't add maturity branches inside decide_campaign — the routing belongs at the consumer.
Why: decisions are facts about the metric; whether to act on a noisy fact is a separate concern that depends on the consumer's tolerance for false positives.
SCALE is for experiments only; promoted GOOD is MONITOR¶
SCALE means "promote this experiment to full audience." A campaign that's already PROMOTED has, by definition, already been scaled — there's nothing left to scale. So a GOOD PROMOTED maps to MONITOR, not SCALE.
If you ever see a SCALE recommendation against a PROMOTED campaign, something is wrong upstream.
NEUTRAL is intentionally NO_ACTION¶
ROAS in [1.0, 2.0) is the "fine, but not exciting" band. We deliberately do not surface these — surfacing them would be noise. If product wants to surface a "watch" state, add it as a new Decision value rather than re-mapping NEUTRAL.
basis is provenance, not a grading input¶
basis ("raw" / "incremental(holdout)" / "modeled(prior_experiment)") tells MCP / reporting where the headline numbers came from. It is purely descriptive. Do not branch grading on it — the classifier already encodes the same information in a way that's safe to switch on.
Use orchestration helpers for promoted campaigns¶
Promoted classification requires a campaign_associations lookup to find the parent. Calling compute_metrics directly will miss this and silently classify a promoted campaign as RAW or EXPERIMENT.
Always go through app/methods/computed_metrics.py for any code path that might see promoted campaigns (i.e. anything that reads metrics for an arbitrary campaign).
Known limitations & open work¶
Things a reader needs to know about current code that aren't obvious from the field tables.
Parent snapshot is re-aggregated on every read¶
For PROMOTED, "parent lifetime experiment_results" is computed on each request rather than snapshotted at promotion time. A late-arriving order attributed to a parent silently changes every active promoted child's modeled revenue and grade. Combined with the bright-line thresholds (roas >= 2.0 → GOOD, < 1.0 → BAD), small drifts can flip grades overnight with no audit trail.
The runtime contract (rate source must be a PriorExperimentSnapshot, not a windowed StatsResponse) is in place; the persistence of that snapshot at promotion time is not. PriorExperimentSnapshot.snapshot_taken_at is reserved for this but currently unused.
CAC sentinel: 0.0 means "denominator too small to trust"¶
incremental_customer_acquisition_cost = spend / incremental_orders blows up when incremental_orders < MIN_INCREMENTAL_ORDERS_FOR_CAC (currently 1.0). safe_cac (app/methods/insights/metric_derivation/helpers.py) returns 0.0 in that case as a stable-shape sentinel — matching the experiment-fallback pattern — and grade_metric requires cac > 0 to engage the GOOD or BAD branch on CAC.
This is a parseability shim until the web client tolerates None in CAC. Touch points are tagged with TODO comments. Same caveat as experiment fallback: a naive "is CAC zero?" check will mis-read these.
Maturity is time-only; volume gate lives elsewhere¶
CampaignMaturity.derive is purely a function of weeks_live, even though the table above describes it as (weeks_live, mailers_sent)-aware. A separate MIN_MAILERS_FOR_RECOMMENDATION = 500 lives in app/methods/recommendations.py and gates whether _evaluate_campaign writes a Recommendation row, but is invisible to decide_campaign and to should_notify. If you tune maturity, tune both — or fold the volume gate into CampaignMaturity.derive and delete the orphan.
empty_dict() / is_empty() shim¶
See Appendix → is_empty. The bulk reporting route emits empty_dict() (same-shape zeros) instead of null so a web-vs-API divergence checker can compare like shapes. Revert to None once that work completes — with the experiment fallback in place, is_empty() now only fires when there is genuinely no data.