What is bad deploy detection?

Bad deploy detection is the discipline of identifying, within minutes of a release, that a specific production change caused a regression — and producing enough evidence to fix or roll back. It is distinct from observability, which describes production state without attributing it to a change; from incident management, which coordinates response after detection; and from feature flags, which limit blast radius without telling you whether a change is bad.

Why it matters

Most teams have monitoring; few have bad deploy detection. The cost is concrete: across the deployments Firetiger has monitored for AI-native engineering teams, roughly one in eight contained a regression that conventional global-threshold monitoring would not have caught within 30 minutes. As AI coding tools push PR volume past what manual review can absorb, the gap between "something is wrong" and "this specific PR caused it" widens with every release. DORA's State of DevOps research consistently puts elite-team change failure rate in the 0–15% band and low performers at 46–60% — a 3-4× gap that compounds over a quarter and is driven, at the source, by how fast teams can attribute regressions to changes.

The reason bad deploy detection is its own discipline, rather than a property of observability or incident response, is that the question "which change caused this?" requires a different model of the world than the question "what is happening right now?" Observability platforms describe production state. Incident tools coordinate response to that state. Neither is built to take a code diff, generate expectations for what should and should not change after deploy, watch production for deviations against those expectations, and connect a deviation back to a specific pull request, owner, and commit.

Why generic monitoring misses bad deploys

Most teams have monitoring. Few teams have bad deploy detection. The gap is not a tooling oversight; it is a structural property of how generic monitoring is configured.

Static thresholds were not designed per change. A standard error-rate alert fires when the global error rate exceeds, say, 1% for five minutes. That works for catastrophic regressions. It does not work for a deploy that introduces a 422 response on a single endpoint affecting 3% of one customer segment. The global error rate may not move at all. A team running this alert configuration will never see the regression on a dashboard; the first signal will be a customer report, hours or days later. The deploy was bad. The monitoring was not configured to detect that particular kind of badness.

Most monitoring is not change-aware. A dashboard does not know that a deploy happened, what code changed, what the change was intended to do, or which signals would distinguish "the change worked" from "the change introduced a regression." The dashboard shows latency and error rate; it cannot tell you that the deploy was supposed to reduce p99 latency by 15% and instead increased it by 8%. That kind of intent-based check has to be authored for each change, and human authors of bespoke monitoring do not scale to a team deploying several times per day.

Symptoms surface before causes, but tools rarely connect them. When a customer-facing metric degrades, the typical response is to page someone, look at dashboards, and form a hypothesis about what changed. The hypothesis-formation step is where most of the wall-clock time of an incident is spent. Engineers scroll through the deploy timeline, ask in Slack whether anyone shipped recently, and try to reason about which of the last five deploys is the most likely culprit. The answer is usually obvious in hindsight, but the diagnosis time is the cost.

Background noise drowns out real signal. A system with a steady 0.5% error rate looks healthy on a dashboard. The same system stepping from 0.5% to 1.5% within four minutes of a deploy is exhibiting a clear regression — but only if you know the deploy happened and you are comparing against the pre-deploy baseline rather than against a static alerting threshold. Many incidents that get classified as "unknown cause" or "transient" turn out, on postmortem inspection, to be deploys that drifted within the alerting band.

The throughline is that generic monitoring is built around system state, not around change events. Bad deploy detection inverts that: it starts from the change event and asks, "given what changed, what should we be watching, and did production behave as expected?"

What signals matter for bad deploy detection

The signals that actually catch bad deploys are not the signals most monitoring tools default to. The defaults — CPU, memory, global error rate, average latency — are mostly availability signals. They catch catastrophic failures and miss subtle regressions. Bad deploy detection requires a different signal mix.

Customer-facing success rates. Can users sign up, sign in, complete the action the system exists to support? These are the strongest leading indicators of a bad deploy because they capture the entire chain of dependencies between the user and the system. A deploy that subtly breaks payment confirmation will not necessarily move CPU or memory at all, but it will move the payment-success rate within minutes.

Per-endpoint and per-route error rates. Global error rate hides regressions concentrated on a single endpoint. A change to one route's request handler can drop that endpoint's success rate from 99.5% to 96% while the application-wide error rate barely moves. Bad deploy detection needs to monitor at the granularity where changes actually happen.

Latency percentiles, not averages. A regression that adds two seconds to p99 latency affects 1% of users seriously while moving the average latency only slightly. Averages are a poor signal for the kind of regression that matters in practice. p50, p95, and p99 each tell different parts of the story.

Database-level signals. Many of the worst post-deploy regressions are database-driven: a query that worked on a small test dataset performs full table scans in production, an index becomes unused after a schema change, connection pool utilization climbs because a code change opens more connections than the previous version. These rarely surface in application-level metrics until the secondary effects (latency, error rate, timeout cascades) propagate up.

Absence of expected signals. Sometimes the most informative check is not "did something bad happen?" but "did the expected good thing happen?" A deploy that was intended to enable a new feature should produce traffic on the new endpoint; if the endpoint sees zero requests an hour after deploy, the feature is silently broken. A deploy intended to improve performance should show a measurable improvement; the absence of improvement is itself a regression against intent.

Time-windowed deviation, not point-in-time threshold. A deploy might look healthy in the first three minutes and degrade as caches warm or background jobs trigger. Bad deploy detection monitors continuously across a configurable window — typically 30 minutes to several hours, depending on the service and the risk tolerance — and compares each window against the pre-deploy baseline rather than against a fixed threshold.

The common property of these signals is that they are change-relative. They are interpreted not against an absolute number but against the system's behavior before the deploy in question. That requires the monitoring system to know when the deploy happened, which services it touched, and what the comparable baseline window looks like.

Why correlation to a specific change is the hard part

The hardest part of bad deploy detection is not detecting deviation. Anomaly detection has been a solved problem in different forms for decades. The hard part is correlating an anomaly to the specific change that caused it, with high enough confidence to act.

Multiple changes overlap. A team that ships ten times per day has overlapping deploy windows. When the payment-success rate dips at 14:07, the team needs to know which of the four deploys between 13:30 and 14:07 is the suspect. Temporal proximity is a weak signal — the most recent deploy is not always the cause. A robust attribution model looks at which services were touched, which code paths the changes hit, and what the changes were intended to do.

Some failures are delayed. A race condition introduced at 09:00 may not manifest until the traffic peak at 11:30. A query regression may not bite until the table grows past a size threshold three days later. Pure temporal correlation breaks down in these cases. Detection systems need to be able to identify regressions that emerge well after the deploy event and still attribute them correctly.

Not every anomaly is deploy-caused. Upstream provider outages, regional network blips, scheduled batch jobs, traffic-driven hot spots, and external attacks all produce anomalies that look like deploy regressions but are not. A system that confidently blames every error spike on the most recent deploy will quickly lose credibility. Distinguishing deploy-caused regressions from environment-driven ones requires investigating the actual evidence — which code paths the anomaly touches, whether the change set could plausibly explain the observed behavior — rather than relying on temporal proximity alone.

Microservice deploys multiply the attribution problem. A "deploy" at 14:00 might be a coordinated release of three services. When the regression appears at 14:15, the question is not "which deploy?" but "which service in the deploy?" Granular service-level attribution is required.

This is why bad deploy detection is fundamentally a workflow problem, not just a metrics problem. It requires linking three things that usually live in different systems: the change (in source control), the deploy event (in CI/CD), and the production behavior (in telemetry). Without that linkage, the team is left to reconstruct the connection by hand during every incident.

For example, Firetiger's approach to bad deploy detection reads the PR diff and description, generates a deployment-specific monitoring plan describing what behavior the change is expected to produce, watches the deploy roll out across staging, canary, and production, detects deviations against the plan, and posts a per-deploy verdict — verified or regression detected — back to the PR, with the evidence the team needs to fix or roll back. Because the verdict is anchored to the specific PR, not to a generic error rate threshold, the attribution problem is resolved by construction rather than reconstructed after the fact.

How bad deploy detection differs from adjacent tools

Bad deploy detection sits in a category that is easy to confuse with the categories around it. The distinctions matter for tool selection.

Observability platforms (Datadog, New Relic, Honeycomb, Grafana) describe production state through telemetry. They are excellent at "what is happening right now?" They are not built around the change event and do not, by default, generate per-change monitoring plans or produce per-deploy verdicts. Most observability platforms can be wired up to flag deploys on a timeline, but the actual diagnosis — "this change caused this regression" — is still a human task. See the comparison: Firetiger vs Datadog.

Incident management platforms (PagerDuty, incident.io, Rootly) coordinate the response to a problem once it has been identified. They route alerts to the right humans, manage on-call rotations, run the incident timeline, and structure postmortems. They do not detect bad deploys; they assume the detection already happened and the alert already fired.

Feature flag platforms (LaunchDarkly, Statsig) limit the blast radius of changes by gating them behind flags. They reduce the impact of a bad change but do not, themselves, tell you that a change is bad. A flagged rollout that is causing a regression in the 10% of users seeing the new code still needs a detection mechanism to surface that regression.

Engineering intelligence platforms (LinearB, Swarmia, Jellyfish) report on team velocity and DORA-style trend metrics over weeks and months. They tell you the change failure rate trend; they do not detect change failures in the release loop.

The simplest mental model: observability shows symptoms, incident management coordinates response, feature flags limit blast radius, DORA dashboards report trends. Bad deploy detection is the layer that connects the change event to the symptom, fast enough and with enough evidence that the team can act in the release loop rather than during the postmortem.

See Firetiger in production

Read how Town keeps AI assistants running with Firetiger — Town's founding engineer on cutting through alert noise, shrinking debugging from days to minutes, and giving coding agents the context they need to do their best work. More teams use Firetiger this way at /case-studies.

Where to start

Instrument a deploy event stream. Whatever ships your code to production should publish an event — service, version, timestamp, commit, PR — that monitoring can subscribe to. Without a clean deploy event source, all downstream correlation work becomes ad hoc.
Define per-service "what should change after a deploy" expectations. For your top three services, write down the signals that should move when a deploy succeeds (or stay flat) and the signals that should not move. This is the seed of a per-change monitoring plan.
Replace global thresholds with change-relative comparisons. For the same top services, set up monitoring that compares post-deploy windows to a pre-deploy baseline rather than to a fixed threshold. This is what catches subtle, deploy-caused regressions that static alerts miss.
Pilot a PR-aware system. A tool like Firetiger that reads each PR's diff, generates change-specific monitoring, and posts a per-deploy verdict to the PR can demonstrate the workflow end-to-end without rebuilding all of your monitoring. See also What is PR-based monitoring? and How to evaluate deploy verification tools.