Jon Moshier / Notes / SDLC Delivery Metrics for AI-Assisted Engineering budding
Note · From the Notebook

SDLC Delivery Metrics for AI-Assisted Engineering

A working glossary of the delivery metrics that actually track engineering throughput (DORA keys plus PR-flow metrics), what each means, how it's measured, and what AI-assisted development is expected to move.

SDLC Delivery Metrics for AI-Assisted Engineering

A dictionary of the numbers teams use to measure software delivery throughput, and how each behaves once AI writes a large share of the code. The through-line: AI floods the top of the pipeline with code, so the metrics that matter shift downstream to review, merge, and stability. Measuring authoring speed alone misses where the time actually goes. See Writing Code vs Shipping Code - AI Productivity Across Tool Generations for the study this rests on: a 100,000-developer event study where a 741% rise in lines of code compressed to a 20% rise in releases.

Two families of metric. DORA measures the delivery system end to end (commit to production, and stability once there). PR-flow metrics decompose the single stage where AI-era bottlenecks concentrate: the pull request. Use DORA to see whether the system got faster; use PR-flow to see why.

The substrate: why any of this is worth measuring

Ignore the dashboards for a moment. Tooling vendors (LinearB, GitKraken, Code Climate, DX, the GitPrime lineage) built the charts and popularized the vocabulary, but the dashboard is not the reason to believe the metrics. The reason is older than the tools and rests on three foundations, not on vendor benchmarks.

The honest limit: even Accelerate is correlational and self-reported. Nobody has cleanly shown that hitting a given number causes better business results. So the defensible use is diagnostic, not evaluative. These metrics find the bottleneck (Theory of Constraints). They do not certify that you are winning, and they say nothing trustworthy about an individual. Hold that thought; it is the whole ballgame when a number lands in a performance review.

DORA: the four keys and the fifth

The DORA program (DevOps Research and Assessment, now at Google) has measured software delivery since 2014. Four metrics, split into two throughput and two stability, plus a fifth added in 2024. Elite-performer thresholds from the 2024 State of DevOps report are in each entry.

Deployment Frequency. How often an organization ships to production.

Lead Time for Changes. Time from a change committed to version control to that change running in production.

Change Failure Rate (CFR). Share of deployments that degrade service and need remediation (hotfix, rollback, fix-forward, patch).

Failed Deployment Recovery Time (formerly Time to Restore Service, often MTTR). How long to recover once a deployment fails in production.

Deployment Rework Rate (added 2024). Share of deployments that require an unplanned follow-up fix.

PR-flow: decomposing the review bottleneck

DORA tells you the system slowed or sped up. These tell you where inside the pull request the time went. This is the layer AI-era measurement lives in, because review and merge are the human-paced stages that absorb upstream AI gains.

PR Cycle Time (time to merge). Total elapsed time from a pull request opened to merged.

Time to First Review (pickup time). From PR opened to its first review comment or approval.

Review Time. From first review to merge (the active back-and-forth).

PR Size. Lines changed (added plus deleted) per pull request.

Code Churn / Rework Rate. Code rewritten or deleted shortly after it was merged.

AI-specific gauges, and the ones to distrust

Beyond delivery metrics, AI programs add usage gauges. Treat them as inputs, not outcomes.

The DORA program’s own 2024 guidance on the “tokenmaxxing” era: AI raises throughput and instability at once, so speed metrics are only meaningful when read against the stability pair.

How to read them without getting fooled

Four failure modes.

Goodhart’s law. The day a metric becomes an individual target, it stops measuring reality and starts measuring fear. Developers split PRs to inflate deployment frequency and suppress incident reports to protect CFR. Read these by team and by trend, never per-person, never tied to compensation. Team metrics create collaboration; individual metrics create gaming.

Speed without a counterweight. Every throughput metric needs a paired stability metric or it drives the wrong behavior. This is the design principle behind the DX Core 4, which unifies DORA, SPACE, and DevEx into four counterbalanced dimensions (speed, effectiveness, quality, impact) precisely so a gain in one cannot hide a loss in another. See DX Core 4.

The measurement lag. Allow three to six months after AI rollout before drawing conclusions. Developers need time to build effective AI workflows, and early numbers capture the learning curve, not the steady state.

Never an individual verdict. The one with real stakes. These are team-level, trend-level diagnostics, and using them to rank, PIP, or fire a person is both statistically indefensible and self-defeating. Indefensible because the signal is confounded: work is not randomly assigned, one engineer takes the six-week migration while another ships CRUD, and per-person samples are far too small to separate skill from assignment or from the luck of the sprint. Self-defeating because the moment people know a number can end their job, they optimize the number and the underlying data dies, so the manager loses the very signal they were trying to act on. DORA’s own guidance is explicit: never tie these to individual performance or compensation. A metric that justifies ending someone’s employment needs a causal, attributable, confound-free claim about that person, and none of these clear that bar. They point at where the system is slow. To judge a person you still have to go read the actual work.

The underlying discipline is Theory of Constraints: the only metric worth optimizing is the one at the current bottleneck, and AI moves that bottleneck from writing to reviewing. Optimizing authoring throughput once review is the constraint just builds inventory in front of the queue, which is also why Work-in-Progress Limits on open PRs often does more for cycle time than any coding-speed gain.

The minimum viable set

Fifteen metrics is a menu, not a dashboard. To measure an AI rollout, stand up four, in two pairs, and add the rest only when a question demands them:

Read all four by team and by trend, overlay AI adoption, and give it three to six months before trusting the numbers. If forced to watch a single leading indicator, watch time-to-first-review: it degrades first when AI floods the review queue.

Try it

Compute your own PR-flow distribution (1-2 hours, GitHub + a script). Pull merged PRs for one repo over 90 days via the GitHub API (gh pr list --state merged --json createdAt,mergedAt,additions,deletions,reviews). Compute median time-to-first-review, PR cycle time, and PR size. Then split the set by whether the PR was AI-assisted (a label, or a co-author trailer). Look for the predicted pattern: AI PRs larger, cycle time similar or better, but time-to-first-review worse as the queue lengthens. If AI PR size is up and churn is up, the speed is being paid back in rework.

Instrument lead time for change (an afternoon, GitHub Actions). Emit a timestamped event at first commit and at production deploy, difference them per change, and chart the median weekly. Overlay AI adoption. The question the chart answers: did total lead time actually fall, or did coding time fall while review time rose to fill the gap? The second pattern is the attenuation result, releases rising far slower than code showing up in your own data.

See also

Sources

← All notes Read recent essays →