Computed Metrics Grading¶

The single source of truth for "is this campaign good, bad, or neither?" across MCP, reporting, insights, and recommendations. There is no parallel grading path — every consumer bottoms out here.

MCP feed. The MCP serves ComputedMetrics.to_dict() as the canonical campaign payload (get_full_campaign_stats, get_org_campaign_metrics, and per-campaign on get_full_org_stats) — not the raw cached StatsResponse. There is no separate "MCP ComputedMetrics"; the MCP is one consumer of this model. See app/mcp/schema_docs.py for the agent-facing contract and reporting.md for the cache it reads.

Reading order: skim TL;DR → scan the section you need → jump to the Appendix for the gnarly bits (windowing, fallback semantics, promoted volume/rate). Each gotcha is anchored so other sections can link straight to it.

TL;DR¶

Derive a ComputedMetrics from a StatsResponse, per-classifier (RAW / EXPERIMENT / PROMOTED).
Grade each of its two splits (first_purchase, lifetime) with grade_metric → GOOD | NEUTRAL | BAD | UNKNOWN.
Decide with decide_campaign(metrics) → PAUSE | SCALE | MONITOR | NO_ACTION. Pure function of grade × classifier.
Route the decision per-consumer using maturity (JustLaunched / Early / Mature). Maturity does NOT change the decision — only who sees it. (why)

Anchor files¶

Concept	File
`ComputedMetrics`, `MetricGrade`, `CampaignClassifier`, `CampaignMaturity`, thresholds, `grade_metric`	`app/core/models/insights/computed_metrics.py`
`decide_campaign`, per-grader `should_notify`, headline/criteria/priority strings	`app/core/models/insights/grader.py`
Stats → `ComputedMetrics` derivation	`app/methods/insights/metric_derivation/`
Promoted-parent wiring on top of derivation	`app/methods/computed_metrics.py`

The shape¶

ComputedMetrics
├── classifier:        RAW | EXPERIMENT | PROMOTED
├── spend, weeks_live, mailers_sent
├── first_purchase:    MetricSplit + first_purchase_grade: MetricGrade
├── lifetime:          MetricSplit + lifetime_grade: MetricGrade
├── maturity:          (derived) JustLaunched | Early | Mature   ← routing metadata, not a grade input
└── basis:             provenance string for MCP/reporting consumers

Each MetricSplit holds revenue, roas, cac, orders, optional uplift, optional win_probability. The two splits are graded independently. basis is provenance only — see Appendix → basis is not a grading input.

Classifier¶

Three classes, controlling which derivation runs and which decisions are eligible:

Classifier	Trigger	Headline source
`RAW`	No holdout, no prior experiment	observed `stats`
`EXPERIMENT`	`campaign.settings.holdout.enabled`	current holdout's `experiment_results`
`PROMOTED`	A `HoldoutToFull` association exists in `campaign_associations`	parent's lifetime `experiment_results`

Promoted classification requires a campaign_associations lookup — call through app/methods/computed_metrics.py, not compute_metrics directly. (why)

Derivation¶

ComputedMetrics is derived on every read — not stored. Same StatsResponse, different fields per classifier.

graph LR
  AGG[(campaign_order_aggregates_hourly)] -->|fetch_hourly_aggregates<br/>± date window| HA[hourly rows]
  RCP[(campaign_recipients)] -->|fetch_recipient_totals_by_campaign<br/>± date window| RT[recipient totals]
  HA --> SUM[sum_aggregates_to_stats] --> SDICT[stats]
  RT -->|mailers_sent, total_cost| DERIV[_compute_derived_metrics]
  SDICT --> DERIV --> SDICT2[stats + roas/cpa]
  HA -->|holdout rows| EXPR[_compute_experiment_results]
  RT -->|Holdout count| EXPR --> ER[experiment_results]
  SDICT2 --> CLASSIFY[classify]
  ER --> CLASSIFY
  CLASSIFY -->|RAW| DR[derive_raw]
  CLASSIFY -->|EXPERIMENT| DE[derive_experiment]
  CLASSIFY -->|PROMOTED| DP[derive_promoted]
  DR & DE & DP --> CM[ComputedMetrics]

Field → source mapping¶

For PROMOTED, fields split into volume (from the child's stats) and rate (from the parent's lifetime experiment snapshot). This split is load-bearing — see Appendix → Promoted: volume vs rate.

Field	RAW	EXPERIMENT	PROMOTED
`spend`, `weeks_live`, `mailers_sent`	`stats.*`	`stats.*`	child `stats.*`
`first_purchase.revenue`	`stats.first_purchase_revenue`	`experiment_results[first_order].incremental_revenue` (fallback: `stats.first_purchase_revenue`)	`child_spend × parent_first_purchase_roas`
`first_purchase.orders`	`stats.first_purchase_orders`	`experiment_results[first_order].orders` (fallback: `stats.first_purchase_orders`)	child `stats.first_purchase_orders`
`first_purchase.roas`	`stats.first_purchase_roas`	`experiment_results[first_order].incremental_roas` (fallback: `0.0`)	parent `experiment_results[first_order].incremental_roas`
`first_purchase.cac`	`spend / orders` (or `None`)	`experiment_results[first_order].incremental_customer_acquisition_cost` (fallback: `0.0`)	parent value, gated on parent's incremental orders
`lifetime.revenue`	`stats.revenue ‖ stats.campaign_revenue`	`experiment_results[all_orders].incremental_revenue` (fallback: same as RAW)	`child_spend × parent_lifetime_roas`
`lifetime.orders`	`stats.campaign_orders ‖ stats.orders`	`experiment_results[all_orders].orders` (fallback: same as RAW)	child `stats.campaign_orders`
`lifetime.roas`	`stats.all_time_roas ‖ stats.roas`	`experiment_results[all_orders].incremental_roas` (fallback: `0.0`)	parent `experiment_results[all_orders].incremental_roas`
`lifetime.cac`	`None`	`None`	`None`
`uplift`, `win_probability`	unused	`experiment_results[*].metric_uplift` / `.win_probability` (no fallback)	parent values
`basis`	`"raw"`	`"incremental(holdout)"`	`"modeled(prior_experiment)"`

The fallback rows for EXPERIMENT and the windowing-induced zeros have surprising downstream effects — see Appendix → Experiment fallback and Appendix → Date windowing.

Maturity (routing, not grading)¶

JustLaunched : weeks_live <= 4  OR  mailers_sent < 500
Early        : weeks_live <= 10 OR  mailers_sent < 5000
Mature       : weeks_live > 10  AND mailers_sent >= 5000

A long-running campaign with few mailers is not mature — volume matters. Tunable in CampaignMaturity.derive.

Consumer	Maturity gate
`decide_campaign`	none (grade × classifier only)
`CampaignGrader.should_notify` (Slack, CSM insights)	`Early`+
`_evaluate_campaign` (writes `Recommendation` row)	`Mature` only
`CampaignGrader.priority`	`Mature` → `HIGH`; `Early` + `PAUSE` → `MEDIUM`; else `LOW`

Maturity is not a decision input — see Appendix → Maturity routes, it doesn't decide.

Grading a split¶

grade_metric(roas, cac, weeks_live, has_orders) → MetricGrade is the only grading function. Runs once per split.

UNKNOWN   weeks_live < 1  OR  not has_orders  OR  roas is None
GOOD      roas >= 2.0    AND  (cac is None or cac <= 40)
BAD       roas < 1.0     OR   cac > 180
NEUTRAL   everything else

Threshold	Value
`GOOD_ROAS_THRESHOLD`	`2.0`
`GOOD_CAC_THRESHOLD`	`40.0`
`BAD_ROAS_THRESHOLD`	`1.0`
`BAD_CAC_THRESHOLD`	`180.0`

All four live in computed_metrics.py. Do not redefine elsewhere.

Decision¶

decide_campaign(metrics) → Decision is pure grade × classifier:

flowchart TD
  start([ComputedMetrics]) --> grade{lifetime_grade}
  grade -->|BAD| pause[PAUSE]
  grade -->|GOOD| cls{classifier}
  grade -->|NEUTRAL / UNKNOWN| noop[NO_ACTION]
  cls -->|EXPERIMENT| scale[SCALE]
  cls -->|RAW or PROMOTED| monitor[MONITOR]

Three things that catch people:

PAUSE fires the moment lifetime_grade == BAD, regardless of maturity. Maturity only gates surfacing.
SCALE is reserved for live experiments graded GOOD. A doing-well promoted campaign is MONITOR — the scale decision already happened. (why)
NEUTRAL (ROAS in [1.0, 2.0)) is intentionally NO_ACTION. (why)

Where the decision is used¶

Caller	Behavior
`campaign_grader.py`	Wraps in `InsightData`; `should_notify` requires `Early`+ AND non-`NO_ACTION`.
`recommendations.py::_evaluate_campaign`	`SCALE` → `ScaleExperiment` rec, `PAUSE` → `PauseCampaign` rec. Requires `Mature`.
`CampaignGrader` (in `grader.py`)	Headline/criteria/priority strings (CSM-facing).

Worked examples¶

Scenario	weeks_live	mailers_sent	classifier	lifetime ROAS	decide	should_notify	Rec written?
Brand-new send, no orders	2	800	RAW	n/a (UNKNOWN)	NO_ACTION	No	No
JustLaunched, BAD	2	800	RAW	0.4	PAUSE	No (maturity)	No
Live experiment crushing it (Early)	6	8000	EXPERIMENT	3.1	SCALE	Yes	No (needs Mature)
Same experiment, mature	12	20000	EXPERIMENT	3.1	SCALE	Yes	Yes — `ScaleExperiment`
Mature automation, mid-band	14	18000	RAW	1.4	NO_ACTION	No	No
Mature automation, underwater	14	18000	RAW	0.7	PAUSE	Yes	Yes — `PauseCampaign`
Promoted send, doing well	12	20000	PROMOTED	2.5	MONITOR	Yes	No

Tuning¶

Knobs: - Four numeric thresholds, the maturity rule, the decision tree — computed_metrics.py / grader.py. - Per-consumer routing gates — next to each consumer (should_notify, _evaluate_campaign's Mature check).

Tune in place. Do not introduce a parallel grading path. If per-org tuning becomes necessary, extend grade_metric / decide_campaign to take an OrgConfig-like argument rather than branching at call sites.

Appendix: Gotchas & things to know¶

Stuff that looks right but isn't, plus the design choices that aren't obvious from the field tables. Each entry has a stable anchor so the sections above can link straight in.

Promoted: volume vs rate¶

A promoted campaign carries forward the rate (ROAS, CAC, uplift, win_probability) that its parent's holdout experiment measured, and multiplies it against the volume (spend, mailers) the child has actually shipped.

modeled_revenue = child_windowed_spend × parent_lifetime_roas

Two roles, two sources:

Volume → child stats. Windowable. Narrows with start_date / end_date.
Rate → parent's lifetime experiment_results. Never windowed.

Mixing these is a silent data-quality bug — a date-windowed rate projected onto live volume is meaningless. The orchestration layer enforces this at runtime by fetching parent stats through fetch_parent_stats_for_promoted, and derive_promoted only accepts a PriorExperimentSnapshot — a windowed StatsResponse won't type-check as a rate source. Persistence of the snapshot at promotion time is still open; see Known limitations & open work.

Stationarity assumption. Using a lifetime parent ROAS assumes that rate is a roughly stable property of the audience × creative. If a parent experiment is old and the audience has shifted, the modeled revenue will drift from reality. There's no age-out today; consider it if this becomes a complaint.

Example. Child has shipped $10k this month; parent's lifetime incremental ROAS is 2.4. Modeled revenue = $24k — even though the child has no holdout of its own.

Date windowing has cascading effects¶

start_date / end_date apply to two underlying queries:

Source	Date filter?	Affects
`fetch_hourly_aggregates`	yes, on `hour_bucket`	order-derived columns (`campaign_revenue`, `first_purchase_revenue`, …)
`fetch_recipient_totals_by_campaign`	yes, on `campaign_recipients.created_at`	`mailers_sent`, `total_cost`, holdout count

Campaigns send in bursts; orders trickle in over months. A typical "last N days" window excludes the recipient rows (dated at send time) while still capturing orders. Knock-on effects:

mailers_sent == 0 and total_cost == 0 → roas / first_purchase_roas / cpa coerced to 0.0 (divide-by-zero guard). Note: not None — coerced zero, which grades as BAD if has_orders is true.
Holdout recipient_count == 0 → experiment_results returns [], which triggers the experiment fallback.

Per-classifier effect under a date window:

Classifier	Behavior
RAW	Partial degradation: `revenue` / `orders` retain real values; `roas` lands at `0.0`.
EXPERIMENT	Falls back to observed `stats` when incremental can't compute.
PROMOTED	Volume narrows; rate stays lifetime; `modeled_revenue` narrows linearly with `spend`.

Example. Campaign sent 50,000 pieces in March; orders are still rolling in. Querying "last 30 days" in May returns revenue from late-arriving orders, but mailers_sent = 0, so ROAS reports 0.0.

Experiment fallback returns `roas = 0.0`, not `None`¶

When experiment_results can't compute incremental (e.g. zero holdout in window), derive_experiment falls back to observed values from stats:

incremental_revenue is None AND experiment_orders is None
  → revenue ← stats.first_purchase_revenue (or stats.revenue ‖ campaign_revenue for lifetime)
  → orders  ← stats.first_purchase_orders  (or stats.campaign_orders ‖ orders for lifetime)
  → roas    ← 0.0   (stable-shape signal: "incremental not computable here")
  → first_purchase.cac ← 0.0   (lifetime.cac stays None as always)
  → uplift / win_probability stay None
  → has_orders OR's in has_positive_count(fallback orders) so grading sees them

When incremental is computable, nothing changes.

Trap: the roas = 0.0 is a sentinel meaning "we have no holdout signal," not "the campaign earned zero." Consumers that need to distinguish the two should check mailers_sent == 0 or look for populated uplift / win_probability. A naive "is ROAS bad?" check will mis-grade these as BAD.

`is_empty()` / `empty_dict()` are temporary¶

ComputedMetrics.is_empty() returns True when both splits have revenue, orders, roas, cac all None. The bulk route uses this to emit empty_dict() (same-shape zeros) so the web-vs-API divergence check can compare stable shapes.

Both methods are temporary — revert to None once divergence work is done. With the experiment fallback in place, is_empty() now only fires when there is genuinely no data.

Maturity routes, it doesn't decide¶

A JustLaunched BAD campaign still produces PAUSE. Maturity only gates who sees the decision (Slack/CSM vs. recommendation row vs. nothing). Don't add maturity branches inside decide_campaign — the routing belongs at the consumer.

Why: decisions are facts about the metric; whether to act on a noisy fact is a separate concern that depends on the consumer's tolerance for false positives.

SCALE is for experiments only; promoted GOOD is MONITOR¶

SCALE means "promote this experiment to full audience." A campaign that's already PROMOTED has, by definition, already been scaled — there's nothing left to scale. So a GOOD PROMOTED maps to MONITOR, not SCALE.

If you ever see a SCALE recommendation against a PROMOTED campaign, something is wrong upstream.

NEUTRAL is intentionally `NO_ACTION`¶

ROAS in [1.0, 2.0) is the "fine, but not exciting" band. We deliberately do not surface these — surfacing them would be noise. If product wants to surface a "watch" state, add it as a new Decision value rather than re-mapping NEUTRAL.

`basis` is provenance, not a grading input¶

basis ("raw" / "incremental(holdout)" / "modeled(prior_experiment)") tells MCP / reporting where the headline numbers came from. It is purely descriptive. Do not branch grading on it — the classifier already encodes the same information in a way that's safe to switch on.

Use orchestration helpers for promoted campaigns¶

Promoted classification requires a campaign_associations lookup to find the parent. Calling compute_metrics directly will miss this and silently classify a promoted campaign as RAW or EXPERIMENT.

Always go through app/methods/computed_metrics.py for any code path that might see promoted campaigns (i.e. anything that reads metrics for an arbitrary campaign).

Known limitations & open work¶

Things a reader needs to know about current code that aren't obvious from the field tables.

Parent snapshot is re-aggregated on every read¶

For PROMOTED, "parent lifetime experiment_results" is computed on each request rather than snapshotted at promotion time. A late-arriving order attributed to a parent silently changes every active promoted child's modeled revenue and grade. Combined with the bright-line thresholds (roas >= 2.0 → GOOD, < 1.0 → BAD), small drifts can flip grades overnight with no audit trail.

The runtime contract (rate source must be a PriorExperimentSnapshot, not a windowed StatsResponse) is in place; the persistence of that snapshot at promotion time is not. PriorExperimentSnapshot.snapshot_taken_at is reserved for this but currently unused.

CAC sentinel: `0.0` means "denominator too small to trust"¶

incremental_customer_acquisition_cost = spend / incremental_orders blows up when incremental_orders < MIN_INCREMENTAL_ORDERS_FOR_CAC (currently 1.0). safe_cac (app/methods/insights/metric_derivation/helpers.py) returns 0.0 in that case as a stable-shape sentinel — matching the experiment-fallback pattern — and grade_metric requires cac > 0 to engage the GOOD or BAD branch on CAC.

This is a parseability shim until the web client tolerates None in CAC. Touch points are tagged with TODO comments. Same caveat as experiment fallback: a naive "is CAC zero?" check will mis-read these.

Maturity is time-only; volume gate lives elsewhere¶

CampaignMaturity.derive is purely a function of weeks_live, even though the table above describes it as (weeks_live, mailers_sent)-aware. A separate MIN_MAILERS_FOR_RECOMMENDATION = 500 lives in app/methods/recommendations.py and gates whether _evaluate_campaign writes a Recommendation row, but is invisible to decide_campaign and to should_notify. If you tune maturity, tune both — or fold the volume gate into CampaignMaturity.derive and delete the orphan.

`empty_dict()` / `is_empty()` shim¶

See Appendix → is_empty. The bulk reporting route emits empty_dict() (same-shape zeros) instead of null so a web-vs-API divergence checker can compare like shapes. Revert to None once that work completes — with the experiment fallback in place, is_empty() now only fires when there is genuinely no data.