What is production regression detection?

Production regression detection is the process of identifying, after a deploy, that something the system used to do well it is now doing worse — fast enough that the team can act before the impact compounds. The hardest regressions to catch are partial: they affect one endpoint, one customer segment, one region, or one feature flag arm without moving global metrics. Reliable detection requires comparing against the right baseline, watching for the right signals at the right granularity, and ruling out non-deploy causes before issuing a verdict.

Why it matters

Most production regressions are local, not global. Across deployments Firetiger has audited, roughly two-thirds of detected regressions concentrated in a single slice — one endpoint, one customer tier, one region, or one feature flag arm — while the global metrics on the standard ops dashboard stayed inside their normal variation band. That geometry is the reason static dashboards miss most of what matters. A monitoring system that alerts on outages but doesn't catch a 2% degradation in checkout success isn't doing regression detection; it's doing outage detection. The combination of speed and granularity is what makes regression detection useful as a category distinct from generic monitoring, and the absence of either turns a "successful deploy" into a customer complaint two days later.

What counts as a regression

A regression is a measurable degradation against a baseline. The two words that do the most work in that definition are measurable and baseline.

Measurable means the degradation has to surface in a signal the system actually collects. A regression that no telemetry can see is invisible. This is one reason why expanding observability coverage is a precondition for regression detection — without signal at the granularity where regressions occur, the detection problem is unsolvable.

Baseline means the comparison is relative, not absolute. A service running at a steady 0.5% error rate is healthy. The same service stepping to 1.5% error rate within four minutes of a deploy is exhibiting a regression — but a static alerting threshold set at 2% would never fire on it. The regression is the change relative to baseline, not the absolute level. Detection systems that rely on static thresholds catch the worst regressions and miss most of the rest.

Two pragmatic categories of regression are worth distinguishing:

Behavioral regressions. The system is doing what it used to do, but less reliably. Some requests now return errors that previously succeeded. Some users now hit a code path that breaks. These manifest as changes in success rate, error rate, response code distribution, or business-event throughput.

Performance regressions. The system is producing the same outcomes, but more slowly or expensively. Latency increases. Resource consumption climbs. Database query times grow. These manifest as changes in latency percentiles, throughput, CPU and memory usage, and database performance metrics.

Both kinds matter. Both are visible in different signals, which is why regression detection cannot rely on a single metric or a single dashboard.

Why averages and global error rates miss most regressions

The default monitoring instinct is to watch a single global number: error rate, average latency, request count. The instinct is wrong, and the reason it is wrong is geometry.

A service that handles 100 different endpoints, 50 customer segments, three regions, and two feature-flag arms has hundreds of independent slices of behavior. A regression in one slice — say, a single endpoint affecting a single customer segment in a single region — represents a small fraction of total traffic. Even a severe regression in that slice (say, 50% error rate) might only move the global error rate from 0.5% to 0.6%. That movement is well within normal background variation. The alert never fires. The dashboard looks fine. The slice is broken.

This is the structural problem with global signals. They aggregate too much. The information about which slice is degraded gets averaged out by all the other slices that are healthy.

Latency averages have the same problem in a different shape. A change that adds two seconds to the response time of 1% of requests increases average latency by about 20 milliseconds — well below the noise floor of normal traffic variation. The same change moves p99 latency by two seconds, which is dramatic. Averages mask tail regressions. Percentiles surface them.

The general rule: the granularity at which regressions occur is finer than the granularity at which global signals are reported. Effective detection has to watch at the granularity where the regressions actually happen.

Partial regressions: endpoint, segment, region, flag

The most operationally important class of regression is the partial regression. It is the most common kind of bad deploy in modern microservice architectures, and the kind most likely to slip past static monitoring.

Per-endpoint regressions. A code change typically affects specific request handlers. The regression shows up on those endpoints first and may never propagate to the service-wide aggregate. Detection at the endpoint level — success rate per endpoint, latency per endpoint, error rate per endpoint — catches this class.

Per-segment regressions. Some changes interact poorly with specific customer data shapes. A pricing change might work for customers in the default tier but break for enterprise customers. A query change might be fine for the median tenant but cause full table scans for the largest. Per-customer or per-tenant slicing is what surfaces this. See What is per-customer observability?.

Per-region regressions. A deploy that rolls out unevenly across regions, or a change that interacts with regional data residency or local infrastructure, can produce a regression in one region only. The global aggregate looks fine; the affected region is on fire.

Per-flag regressions. A change behind a feature flag is, by design, only seeing a fraction of traffic. The regression is concentrated in the cohort exposed to the new code. Without slicing by flag arm, the regression hides in the rollout percentage.

Reliable detection means producing signals at each of these slice levels — endpoint, segment, region, flag — and comparing each slice against its own pre-deploy baseline. This is more telemetry, more granular indices, and more dimensions to slice on than most teams maintain by default. The investment in instrumentation is a precondition for the investment in detection.

For example, Firetiger's Change Monitor evaluates the deploy across the slices the changed code is likely to touch — derived from the diff, the touched code paths, and the service's traffic patterns — rather than relying on global aggregates. When a regression is detected, the verdict identifies the affected slice explicitly: which endpoint, which segment, which region, which flag arm. That specificity is what makes the verdict actionable.

Distinguishing deploy-caused from environmental causes

Not every regression that appears after a deploy was caused by the deploy. Treating temporal proximity as proof of causation is the fastest way to lose credibility with the engineers consuming detection output.

The usual sources of false attribution:

Upstream provider issues. A third-party API the system depends on starts returning errors or slowing down. The application-layer symptoms look like a regression, but the cause is upstream. The deploy is innocent.

Regional network problems. A cloud region experiences degraded inter-zone networking. Services in that region see latency increases that look deploy-correlated only because the timing happens to line up.

Traffic-driven hot spots. A sudden traffic spike — a marketing campaign, a viral event, a competitor outage — pushes the system past a capacity threshold. The regression is real but the cause is load, not code.

Scheduled batch jobs. A nightly job runs at 02:00 and degrades shared database performance. If a deploy happened at 01:55, the dashboards will appear to blame the deploy.

Robust detection systems address this by triangulating: the change should plausibly explain the observed regression. A deploy that changed only authentication code is unlikely to be the cause of a database performance regression in a different service. A deploy that touched a query path that matches the slow query in production is much more likely to be the cause. The reasoning is not certain, but it is much better than pure temporal correlation.

A practical heuristic: when issuing a regression verdict, the detection system should be able to point at the specific code path or behavior in the change that could produce the observed symptom. If it cannot, the verdict is weaker, and should be marked as "regression observed, deploy correlation uncertain" rather than as a confident attribution. This honesty pays for itself the first time the alternative would have produced a wrong rollback.

How detection connects to the rest of the workflow

Detection is the start of a workflow, not the end. A regression that the system detects but nobody acts on does no good. The detection must reach a human or an automated handler in time to matter, with enough context to act.

Three handoffs matter:

To the on-call engineer. The first signal usually goes to the engineer responsible for the affected service. The signal needs to include: what is degraded, how badly, against what baseline, when it started, what deploy is the most likely cause, and where to look first. A bare "anomaly detected" notification fails this test.

To the incident management surface. If the regression rises to incident severity, it should land in the incident management tool (PagerDuty, incident.io, Rootly) with the same context. The on-call engineer should not be reconstructing the picture by hand.

To the change author. The PR author often has the most context for the change and the fastest path to a fix or rollback. Posting the verdict back to the PR — with the affected signals, the suspected code path, and a recommendation — closes the loop in the most useful place.

See Firetiger in production

Read how Town keeps AI assistants running with Firetiger — Town's founding engineer on cutting through alert noise, shrinking debugging from days to minutes, and giving coding agents the context they need to do their best work. More teams use Firetiger this way at /case-studies.

Where to start

Inventory the slices that matter. For each top service, list the endpoint groups, customer segments, regions, and feature flag arms where regressions are likely to concentrate. This list becomes the dimensional axis for slice-aware detection.
Switch from static thresholds to baseline comparisons. For the same services, replace fixed alerting thresholds with comparisons against a pre-deploy baseline window. This is the single most impactful change for catching subtle regressions.
Watch percentiles, not averages. Especially for latency, drop average latency in favor of p50, p95, and p99. Each percentile reveals a different class of regression.
Pilot a change-aware detection layer. A tool like Firetiger that evaluates each deploy against slice-level signals derived from the PR diff, and that distinguishes deploy-correlated regressions from environment-driven ones, can demonstrate the workflow end-to-end. See also What is bad deploy detection? and What is release verification?.