Jon Moshier / Notes / DORA Metrics budding
Note · From the Notebook

DORA Metrics

The four (now five) software-delivery metrics from Google's DevOps Research and Assessment program, why they are paired into throughput and stability, and what the research does and does not license you to conclude.

DORA Metrics

DORA (DevOps Research and Assessment) is a research program that has run an annual State of DevOps survey since 2014, now housed at Google. Out of years of that survey data it distilled a small set of metrics that predict software-delivery performance. The metrics get quoted everywhere; the research design behind them, and its limits, get quoted almost never. That gap is where most misuse lives.

The four keys, split into two pairs

Four metrics, deliberately arranged as two throughput measures against two stability measures.

Throughput:

Stability:

The pairing is the whole design. Throughput without stability rewards shipping fast and broken; stability without throughput rewards shipping nothing. Measured together, a team cannot quietly trade one for the other, because a gain in speed that breaks things shows up in the stability pair. Low performers, at the other end, can take one to six months to ship a single change.

The fifth metric, and why it was added

The 2024 State of DevOps report added rework rate: the proportion of deployments that are unplanned fixes for user-visible problems. DORA added it in part because change failure rate had been doing double duty as a proxy for rework volume; splitting the concept out measures it directly. The 2024 report found rework rate and change failure rate closely correlated, which is the point: both are stability signals, and having two makes the stability side of the ledger harder to fool. A 2023 report had already floated reliability (meeting operational targets like availability and latency) as an operational-performance dimension sitting alongside the delivery four. So the “four keys” are now more honestly five-plus-one.

What the research actually claims

DORA is not a vendor dashboard with a benchmark attached. It comes out of a specific research method. In Accelerate (2018), Nicole Forsgren, Jez Humble, and Gene Kim ran structural equation modeling on years of survey responses and found that software-delivery performance predicts organizational performance (profitability, productivity, market share). The four keys are the output of that model, not a list someone found intuitive. Worth keeping in view: the program now lives at Google, which sells the Cloud tooling that a case for faster, safer delivery tends to justify. The research predates the acquisition, but the incentive to promote it is real.

Underneath the metrics, the same research identifies roughly two dozen capabilities that drive them: technical practices (version control, trunk-based development, continuous integration, test and deployment automation, loosely coupled architecture), process practices (small batches, work-in-process limits, visual management), and cultural ones (Westrum’s generative culture, transformational leadership, psychological safety). The metrics are the scoreboard; the capabilities are the levers. Teams that try to move the scoreboard without touching the levers tend to be the ones that end up gaming.

The honest limit sits in the method itself. The findings are correlational and self-reported. No one has cleanly shown that hitting a given deployment frequency causes better business results, only that the two travel together across a large survey population. So the defensible use is diagnostic, not evaluative: the metrics locate where the delivery system is slow or unstable. They do not certify that a team is winning, and they say nothing trustworthy about any individual.

How teams misuse them

Goodhart’s law. Once a metric becomes an individual target, it stops measuring the system and starts measuring fear. Developers split PRs to inflate deployment frequency, suppress incident reports to protect CFR, and avoid risky-but-necessary work to keep their numbers clean. DORA’s own guidance is explicit that these are team-level and organization-level measures and should never be tied to individual performance or compensation. This is the single most common and most damaging error, and it is self-defeating: the moment people know a number can end their job, they optimize the number and the underlying signal dies.

A single key in isolation. Any one metric invites gaming precisely because the counterweight is missing. You can inflate deployment frequency with trivial commits or flatter CFR by shipping nothing. The metrics only work as a set. This same principle is why the successor frameworks fold DORA into a wider balance: the SPACE framework insists on multiple dimensions including developer perception, and DX Core 4 pairs every speed metric with an explicit quality one.

Cross-team ranking. Deployment frequency for a team shipping a monolith to regulated infrastructure is not comparable to a team shipping a stateless web service. The clusters (elite, high, medium, low) are useful for a team to locate itself and set a direction of travel, not for a leaderboard. The exact cluster boundaries shift year to year, so read them from the current annual report rather than a number memorized from an old one.

AI and the “tokenmaxxing” era

DORA’s own 2024 guidance on AI is the sharpest recent statement of why the pairing matters. Its finding: AI raises throughput and instability at the same time. Code gets generated faster, and more of it, so it moves through the top of the pipeline quickly, but the defect and rework pressure rises with it. Read the speed metrics alone and AI looks like a pure win. Read them against the stability pair and the real trade shows up. This is the same downstream-bottleneck story developed in SDLC Delivery Metrics for AI-Assisted Engineering: authoring gets cheap, review and stability stay human-paced, and the metrics that matter shift toward the stability side.

Try it

Compute your own four keys from Git and CI history (a weekend, GitHub + a script). Pull merged PRs and deploy events for one service over 90 days. Deployment frequency is a count of production deploys per week. Lead time is the median gap from first commit on a change to its production deploy. Change failure rate is the share of deploys followed by a hotfix or rollback within a window. Recovery time is the median incident-open to service-restored. Chart all four weekly and look for the trade the pairing is designed to expose: a period where deployment frequency climbed while CFR climbed with it is speed bought on credit, not a real gain. Google’s open-source Four Keys project does exactly this if you would rather not hand-roll it.

See also

Sources

← All notes Read recent essays →