Skip to main content

Experiment Statistics

Flagsmith's statistics engine answers three questions about an experiment: Am I winning? (is the variant better than control), by how much? (the lift), and can I trust it? (is the difference real, and was traffic split fairly). This page explains the terms you'll see, in plain language — no statistics background needed.

Coming soon — Enterprise beta

Experiment statistics aren't generally available yet. They are launching as a beta on Enterprise plans — to join, get in touch. Everything described on this page is part of that upcoming release; it previews what the feature will do.

Terms you'll see

Experiment — A controlled comparison: show different versions of a feature to different people, measure which performs better.

Variant — One version being compared. Each person sees exactly one.

Control — The "leave things as they are" variant — your current experience, the baseline everything is measured against (its key is the reserved value control).

Treatment — Any variant that isn't the control — the change you're testing.

Identity — One user, device, or account — the individual whose behaviour you measure.

Exposure — The moment an identity is shown a variant. The exposure count is how many people entered the experiment.

Conversion rate — The percentage of people who did the thing you care about (4.21% ≈ 4 in 100).

Metric — What you measure to judge success — checkout rate, revenue, page views.

Goal metric — A metric you want to improve (the reason you're running the experiment).

Guardrail metric — A metric you want to keep an eye on so the change doesn't make it worse.

Lift — How much better or worse a variant did than control, as a percentage. "+11%" means 11% better.

Credible interval — Our confidence range — the band we're 95% sure the true lift falls inside. Narrow = precise; wide = needs more data.

Chance to beat control (chance to win) — The probability a variant is genuinely better than control. "97%" reads exactly as it sounds.

Winning / losing / inconclusive — The verdict. Winning = over 95% chance to win. Losing = under 5%. Inconclusive = can't tell yet, keep collecting.

Sample ratio mismatch (SRM) — A health check that flags when traffic wasn't split the way you set it up. A broken split means the results can't be trusted.

Quarantined / excluded identity — Someone recorded in more than one variant. Set aside so they don't distort the counts, and shown as a separate total.

Collecting data — Not enough data yet to report a result, so numbers are withheld rather than shown when they'd be meaningless.

Last computed (as-of) — Results are computed periodically, not on every page load. This timestamp tells you how fresh the figures are.

Exposures

An identity is counted once, in the variant it saw first. Because exposures are deduplicated by identity and keep only the earliest timestamp, duplicate event delivery can't inflate your counts, and each identity lands in the time bucket of its first exposure.

If an identity is recorded against more than one variant (a "flicker", or bucketing that changed mid-flight), it's quarantined — excluded from every variant's count and surfaced as a single excluded-identities figure. A small number is normal; a growing one means users are slipping between variants.

The panel shows a headline total, a cumulative chart (one line per variant), a variant table (key, Control badge, identities, share %), the excluded note, and a "last computed" time.

Experiment results

For each metric, the scorecard reports how each variant did against control, using a Bayesian engine. Three numbers per variant:

Lift — relative change vs control. Control at 4.21%, variant at 4.68% → (4.68 − 4.21) / 4.21 ≈ +11%.

95% credible interval — the range we're 95% sure the true lift sits in, drawn as a bar centred on zero:

  • Clear of zero (e.g. +2% to +20%) → confident the variant is genuinely better (or, on the negative side, worse).
  • Crosses zero (e.g. −3% to +14%) → inconclusive; the effect could be anything. Collect more data.

Chance to beat control — the same belief as one number, e.g. 97%. Over 95%winning; under 5%losing; in between → inconclusive.

Why Bayesian?

"97% chance to beat control" means what it sounds like — unlike a p-value, which is routinely misread. It's also safe to check whenever you like: peeking doesn't inflate your error rate the way repeated p-value checks do.

Sample ratio mismatch (SRM)

Before trusting a result, you need traffic to have split the way you configured. If you set 50/50 but see 9,120 in control and 6,400 in the variant, something is broken — and if assignment is broken, every other number is suspect, because the groups aren't comparable.

Flagsmith compares the observed split against the configured one and raises a warning when the imbalance is a one-in-a-thousand event or rarer (it only checks once there are at least 100 identities). When it fires, don't act on the results — investigate one-variant crashes, redirects that bypass bucketing, flicker, or dropped events first.

Collecting data

Statistics on a handful of people are meaningless, so a metric shows collecting data until every arm has at least 50 identities (and, for conversion metrics, at least 5 conversions). This stops you reading a "+300% lift" off three conversions.

Metric types

AggregationWhat it measuresExample
OccurrenceDid the event happen at least once? (0 or 1)Did the user check out?
CountHow many times the event happened per identityNumber of page views
SumThe total of a numeric value across eventsTotal revenue
MeanThe average numeric value per identityAverage order value

A metric's expected direction (up, down, not-increase, not-decrease) tells Flagsmith which way is "good" and sorts metrics into goals and guardrails.

Summary

Nothing here is generally available yet — the table shows what the upcoming Enterprise beta will include.

CapabilityStatus
Exposures panel (counts, chart, share)In the first beta
First-exposure attribution & duplicate immunityIn the first beta
Quarantined (multi-variant) identitiesIn the first beta
Metric definitionsIn the first beta
Results scorecard: lift, credible intervalPlanned
Chance to beat control & winning/losing flagsPlanned
Sample ratio mismatch (SRM) checkPlanned
Collecting-data floorPlanned
Risk / decision banner / trend chartNot currently planned
Frequentist engineDeferred

For experiment setup — multivariate flags, bucketing, identities — see Experimentation (A/B Testing) and managing identities.