Examples of bad deploys static monitoring misses
Static dashboards and threshold-based alerting catch catastrophic regressions and miss most of the rest. The structural reason is that they aggregate too much, watch the wrong signals, and don't compare against a change-relative baseline. The six scenarios below are all real shapes of regression — partial endpoint failure, customer-segment failure, delayed memory leak, feature flag interaction, AI-generated semantic bug, degraded business metric — and in every case the deploy was bad, the dashboard was clean, and the team found out from customers.
Why it matters
These shapes of regression aren't rare edge cases — they're the majority of post-deploy incidents. Across teams Firetiger has audited, more than two-thirds of detected regressions concentrate in a single slice (one endpoint, one customer tier, one region, one flag arm) without moving global metrics enough to trigger a threshold-based alert. The scenarios that follow make the structural argument concrete: generic monitoring is built around system state, not around changes, and the most common production regressions hide in the slices and time windows where state-based monitoring doesn't look. See What is bad deploy detection? and What is production regression detection? for the conceptual framing.
This article is a collection of scenarios. Each one starts from a small, plausible production change, describes what the standard monitoring stack saw at the time, then explains what was actually happening underneath. Names and numbers are illustrative, not from any specific customer.
1. Partial endpoint regression
The change. A backend team modifies the request handler for /checkout/confirm to use a new internal client library for payment authorization. The library is well-tested in isolation and behaves correctly in staging. The PR ships at 14:03.
What static monitoring saw. The service exposes a global error-rate alert that fires when application-wide errors exceed 1% for five minutes. The pre-deploy error rate was 0.4%. The post-deploy error rate is 0.6%. The alert never fires. The service dashboard shows green across the standard tiles. The team moves on.
What was actually happening. The new authorization library has a subtle bug in how it handles one of the upstream payment provider's response shapes. For the 6% of requests that exercise that response shape, the handler returns 502. For the other 94%, behavior is unchanged. The endpoint's success rate has dropped from 99.5% to 93.5% — a meaningful regression for that endpoint, harmful to a real subset of users, and entirely hidden by the global error-rate aggregation.
How change-aware verification catches it. A monitoring plan derived from the PR diff knows the change touched /checkout/confirm and the authorization client. It watches that endpoint's success rate specifically, against the pre-deploy baseline window for the same endpoint. The 99.5% → 93.5% step is visible in minutes and lands as a regression-detected verdict on the PR.
2. Customer-segment failure
The change. A platform team updates a query that powers tenant-scoped analytics. The change rewrites the query to use a new join order that performed better in benchmarks against the median tenant's data. The PR ships during a low-traffic window.
What static monitoring saw. Query latency dashboards show p50 latency improving slightly. p99 latency moves up a bit, but stays within the noise band the team is used to. Database CPU is unchanged. Application error rates are flat. There is no obvious signal that anything is wrong.
What was actually happening. The benchmark tenant had roughly 50,000 rows in the central fact table. The team's largest enterprise customers have 50 million. For those tenants, the new join order causes the database to choose a plan that materializes a much larger intermediate result, which then takes 18 seconds to return instead of 200ms. For the small tenants that dominate the traffic mix, latency improved. For the few large tenants that dominate the revenue, the product is functionally broken.
The first signal is a support ticket from one of the enterprise tenants twelve hours after deploy. Investigation takes another six hours because the dashboards say everything is fine.
How change-aware verification catches it. A plan derived from the diff knows the change touches a tenant-scoped query path. It evaluates latency percentiles per tenant tier (or at least for the top N tenants by volume), and compares each slice against its own pre-deploy baseline. The 200ms → 18,000ms step on the largest tenants is immediately visible against the per-tenant baseline. The plan reports a regression in the enterprise segment specifically, with the affected query and tenants named.
3. Delayed memory leak
The change. An engineer adds a small caching layer in front of a frequently-called external API. The cache uses an in-process map, keyed by request parameters, and was sized "generously" during development. The PR ships at 09:30.
What static monitoring saw. For the first three hours, latency improves on the wrapped endpoints. Memory consumption ticks up a few percent and plateaus. Standard metrics suggest the change is a clear win. The team uses the time for code review on other PRs.
What was actually happening. The cache has no eviction policy. The keying space is much larger than the developer assumed — the parameters include user-specific identifiers that produce effectively unbounded distinct keys over time. Memory consumption looks plateau-shaped because the daily traffic pattern is currently in a trough; as evening traffic rises, the map grows faster than allocations get freed. By 22:30 the process is at 92% of its memory limit. At 22:47 one pod OOMs, triggers a cascade of restarts, and pages the on-call engineer with an alert that says nothing about the morning deploy.
How change-aware verification catches it. A monitoring plan that knows the change adds a new in-process cache watches memory growth specifically for the deployed pods over an extended window — not just an acute post-deploy 30-minute check. The plan tracks RSS slope and compares it to the pre-deploy slope for the same service. Within an hour or two of the deploy, the post-deploy slope is materially higher; by mid-afternoon, the verdict has already updated to flag a memory regression long before the OOM cascade. The handoff names the change as the source and links the affected pods.
4. Feature flag arm interaction
The change. A team ships a new recommendation algorithm behind a 10% feature flag. The flag is enabled for users matching a low-risk segment. The team plans to ramp the flag to 25% the next morning if metrics look healthy.
What static monitoring saw. Global recommendation engagement metrics are flat. Global error rate is flat. Global latency is flat. The on-call dashboard is green. The team approves the next ramp.
What was actually happening. The new algorithm performs slightly worse than the old one for the 10% cohort exposed to it — click-through is down 18% inside the flag arm. Inside the control cohort, click-through is unchanged. The cohort-level regression averages out across all users to a 1.8% drop, which is within the noise floor of normal daily variation. The flagged rollout expands to 25%, then 50%, then 100% over the next week. By the end of the rollout, the team has shipped a worse algorithm to all users and the engagement metric has stepped down by 18%. The cause is identified only after a marketing analyst asks why engagement is off in the weekly report.
How change-aware verification catches it. A plan that knows the change is gated by a specific feature flag slices engagement metrics by flag arm and compares treatment to control. At the 10% rollout phase, the per-arm view shows treatment engagement is materially below control engagement, with statistical significance, within the first day. The verdict surfaces this as a regression in the treatment cohort and explicitly recommends holding the ramp.
5. AI-generated semantic bug
The change. A developer asks an AI coding agent to add validation for a new optional field on a public API. The agent produces a clean-looking implementation: it adds the field to the request schema, validates it when present, and writes a unit test that passes. The diff is small. Review is fast.
What static monitoring saw. Application error rate is unchanged. Latency is unchanged. New issues in error tracking show nothing related. The deploy looks clean.
What was actually happening. The agent's implementation has a semantic bug specific to the production data shape: when the new field is absent (as it is for most legacy clients), the validator coerces it to an empty string and then the downstream consumer treats the empty string as a valid value, producing a different code path than the previous "field not present" case. For 11% of inbound requests — all legacy clients — the response now contains a quietly wrong status code that the clients silently accept. No exception is thrown. The bug is invisible to error tracking. It is also invisible to the unit tests, which exercise the "field present" case. It becomes visible four days later when a partner team notices an unrelated metric is off.
How change-aware verification catches it. A plan that reads the diff knows the change touches request validation on a public API. It watches the response-code distribution per endpoint, compared to the pre-deploy baseline, and the per-client-version distribution where available. The shift in response codes for the legacy-client cohort is visible as a deviation against the per-cohort baseline within the post-deploy watching window. The plan flags an unexpected change in output behavior, which is the right level of abstraction to catch a semantic bug that doesn't raise an exception. See also How does AI-assisted development change deployment risk?.
6. Degraded business metric with no global alert
The change. A web team ships a redesign of the checkout flow's mobile layout. The change is purely presentational on the surface — no backend logic touched, no schema changes, no API behavior changes. The PR is treated as low risk.
What static monitoring saw. Application error rates are unchanged. p99 latency is unchanged. Database performance is unchanged. The standard operational dashboards are green. The deploy is celebrated as smooth.
What was actually happening. The redesign introduced a subtle change to the layout of the payment form. On certain mobile browsers, the "submit" button is now partially obscured by a fixed-position element. Users who can see the button complete checkout normally; users on the affected browsers either don't notice the button or tap a different element that does nothing. The conversion rate on mobile drops from 4.1% to 3.4% — a 17% relative drop on a business-critical funnel. The change is invisible to anything the engineering team monitors. It is detected when the daily revenue dashboard ships on Monday morning, three days after the regression went live.
How change-aware verification catches it. A plan that includes business-event metrics — checkout success rate, conversion rate, sign-up completion rate — watches them per deploy with the same change-relative baseline comparison as the operational signals. A redesign that ships on Friday and immediately steps the mobile conversion rate down 17% is plainly visible against the Friday-pre-deploy baseline. The plan flags the business metric regression as a deploy-correlated event and points at the redesign PR as the cause.
For example, Firetiger's monitoring plans are not constrained to error rates and latency. The plan includes whichever business signals the team has identified as material — checkout conversion, sign-up success rate, retention proxies, anything captured in telemetry — and evaluates them against pre-deploy baselines just like any other signal. When the redesign PR ships, the verdict that lands on the PR identifies the conversion-rate regression directly.
The common thread
Six scenarios, six different shapes of regression. Each one is a real deploy that caused real damage. In every case, the standard monitoring stack — global error rate, p99 latency, CPU/memory, application error tracking — saw nothing actionable or only saw the symptom much later. The reason is not that the monitoring was misconfigured. The reason is that it was watching at the wrong granularity, against the wrong baseline, and without knowledge of what the change was supposed to do.
The structural fix is the same in every case: produce a monitoring plan derived from what the PR actually changes, watch the slices and signals that change should affect, and compare each slice to its own pre-deploy baseline. This is the discipline of change-aware deploy verification. It is what PR-based monitoring and bad deploy detection are about as concepts, and it is the mechanism behind the rest of the Change Management category.
See Firetiger in production
Read how Town keeps AI assistants running with Firetiger — Town's founding engineer on cutting through alert noise, shrinking debugging from days to minutes, and giving coding agents the context they need to do their best work. More teams use Firetiger this way at /case-studies.
Where to start
- Audit your last ten postmortems against this list. For each, ask: which of the six scenario shapes was this? If most of them rhyme with the patterns above, the monitoring isn't broken — the monitoring isn't change-aware. That's the gap to close.
- Add per-slice baselines for your top services. For each high-frequency service, identify the slices (endpoint, customer tier, region, flag arm) where regressions are likely to concentrate. Start producing per-slice baselines so the comparison is change-relative rather than threshold-relative.
- Include business-event metrics in your deploy watching. Checkout conversion, sign-up success, retention proxies, any business signal that captures real user outcomes. Most operational dashboards don't include these; deploy verification should.
- Pilot a per-PR monitoring approach. A tool like Firetiger that reads each PR's diff, generates a change-specific plan, and produces a verdict on each deploy demonstrates the workflow end-to-end. See also How to evaluate deploy verification tools.