What are DORA metrics?

DORA metrics are a small set of indicators — deployment frequency, lead time for changes, change failure rate, mean time to recovery, and a newer reliability metric — that Google's DevOps Research and Assessment program identified as predictive of software delivery performance. They work best as a diagnostic surface that points teams toward real bottlenecks, not as targets to optimize directly.

Why it matters

DORA's State of DevOps research is the closest thing the engineering industry has to a unified theory of high-performing teams. Across more than a decade of survey data, the same four metrics keep separating elite organizations from low performers regardless of company size, industry, or stack. The gaps are dramatic: elite teams deploy on-demand multiple times per day while low performers deploy less than monthly; elite teams recover from failure in under an hour while low performers can take a week. DORA's central insight is that velocity and stability are not in tension — elite organizations ship more often and recover faster and break things less often. The metrics are most useful when read together: a team reporting high deployment frequency and a 0% change failure rate is almost certainly under-counting failures, and across teams Firetiger has audited the actual CFR is typically 30-40% higher than the reported number when measurement comes from tickets alone.

The original four metrics break down into two pairs. Deployment frequency and lead time for changes capture velocity: how often you ship and how long a change takes to reach production. Change failure rate and mean time to recovery capture stability: how often deployments cause problems and how quickly you recover. A more recent fifth metric, reliability, captures the user-experienced consequence — roughly, the share of recent deploys that were unplanned fixes for user-facing bugs.

The five metrics, briefly

Deployment frequency — how often the team ships a production change. Elite performers deploy on-demand (multiple times per day); low performers deploy less than once per month.
Lead time for changes — the time from a code change being authored to it running in production. Different teams measure from different start points (first commit, PR opened, PR merged), and the choice changes what the metric tells you.
Change failure rate — the percentage of deployments that cause a degradation requiring remediation. Elite performers sit between 0% and 15%; low performers can be ten times higher.
Mean time to recovery — the average time from the onset of a production incident to the restoration of service. Elite performers recover in under an hour.
Reliability (the fifth DORA metric) — added in more recent State of DevOps reports, this metric asks roughly what percentage of recent deployments were unplanned fixes for user-facing bugs. It ties the other four to user experience.

DORA as a diagnostic, not a destination

The most common way teams misuse DORA is to treat the metrics as targets. Goodhart's law applies forcefully here: when a metric becomes a target, it stops being a useful measure. Teams that are evaluated on deployment frequency learn to split a single change across many trivial PRs. Teams evaluated on change failure rate learn to avoid filing incidents. Teams evaluated on MTTR learn to close tickets fast.

A more durable framing — one we hear consistently from engineering leaders — is to use DORA as a North Star diagnostic surface that helps a team find where the real bottlenecks are. If lead time is dominated by build-and-test duration, the work is in CI infrastructure. If change failure rate is high during specific deploy windows, the work is in deploy verification. If MTTR is long because investigation takes hours, the work is in observability. The metrics themselves are not the goal; they point to the work.

Engineering leaders also commonly observe that DORA approximates what they actually care about, rather than being it. The deeper goal is usually framed as enabling a large, fast-moving team to ship safely without coordination overhead. DORA is the most widely accepted approximation of that goal, but it is still an approximation.

Why the way you measure DORA matters

There are many vendors offering DORA dashboards: LinearB, Jellyfish, Swarmia, Sleuth, DX, Datadog DORA, and others. Most of them compute the metrics in similar ways, and most of those ways depend on tickets, surveys, or pattern-matching against git history. That is the industry default — but it is not the only option, and it has well-known limitations.

Change failure rate is typically computed in one of three ways: pattern-matching git history for "revert"/"hotfix" commits, correlating deployments with incidents opened in PagerDuty within a time window, or counting tickets labeled "production incident" in Jira. Each of these depends on human discipline (filing the right tickets, applying the right labels) and each misses categories of real failures — most notably the silent regressions that never trigger a page.

Mean time to recovery is almost universally measured as the lifecycle of an incident ticket: open-to-close. This rewards teams who update Jira quickly and penalizes teams who fix things faster than they file paperwork.

Deployment frequency and lead time depend on instrumenting the deploy pipeline and joining deploy events with commit metadata. The work is straightforward but the data is widely reported as painful to collect — GitHub API rate limits, varied CI/CD systems, monorepo-to-service mapping, and definitions of "production" that differ per service all slow teams down. It is common to find motivated engineering teams running homebrew ETL pipelines into a warehouse just to get clean DORA inputs flowing.

A telemetry-grounded approach measures the same metrics from production behavior instead. Deployment frequency comes from deploy webhooks. Lead time comes from joining commit metadata to deploy events. Change failure rate comes from correlating deploys with anomalies observed in production telemetry on the services owning the changed code. MTTR comes from the time between a metric leaving its baseline and the same metric returning to baseline. This approach does not depend on ticket hygiene, label discipline, or extensive per-service configuration.

For a fuller treatment, see Why ticket-based DORA metrics fall short.

How Firetiger measures DORA

Firetiger reads each PR diff, generates a deployment-specific monitoring plan, watches the deployment across staging, canary, and production, detects regressions, and investigates root cause. The same data that drives change-aware production monitoring also gives Firetiger the inputs for DORA: deploy events from GitHub webhooks, commit-to-deploy correlation for lead time, the Change Monitor's verdict for change failure rate, and telemetry-derived recovery windows for MTTR. Service mapping comes from traces tagged with service.name and service.version rather than customer-supplied YAML, so teams don't stall on the configuration step that typically blocks DORA implementations for months.

The point is not "we have a DORA dashboard." The point is that with the right telemetry already in place, DORA falls out as a side-effect of measurement rather than as a separate build.

See Firetiger in production

Read how Town keeps AI assistants running with Firetiger — Town's founding engineer on cutting through alert noise, shrinking debugging from days to minutes, and giving coding agents the context they need to do their best work. More teams use Firetiger this way at /case-studies.

Where to start

Pick one metric and measure it honestly for 90 days. Trying to instrument all five at once is how DORA projects stall. Start with deployment frequency (easiest) or change failure rate (most informative).
Decide what "production" means per service. This is the definition that quietly breaks every DORA implementation. Write it down before you start measuring.
Distinguish the diagnostic from the target. Use the numbers to find bottlenecks; do not set them as goals tied to compensation, and do not publish them as leaderboards.
Prefer telemetry-grounded measurement where you can. Ticket-based measurement is the industry default, but it under-counts silent regressions and penalizes teams who fix things faster than they file paperwork.