SDLC Delivery Metrics for AI-Assisted Engineering

A dictionary of the numbers teams use to measure software delivery throughput, and how each behaves once AI writes a large share of the code. The through-line: AI floods the top of the pipeline with code, so the metrics that matter shift downstream to review, merge, and stability. Measuring authoring speed alone misses where the time actually goes. See Writing Code vs Shipping Code - AI Productivity Across Tool Generations for the study this rests on: a 100,000-developer event study where a 741% rise in lines of code compressed to a 20% rise in releases.

Two families of metric. DORA measures the delivery system end to end (commit to production, and stability once there). PR-flow metrics decompose the single stage where AI-era bottlenecks concentrate: the pull request. Use DORA to see whether the system got faster; use PR-flow to see why.

The substrate: why any of this is worth measuring

Ignore the dashboards for a moment. Tooling vendors (LinearB, GitKraken, Code Climate, DX, the GitPrime lineage) built the charts and popularized the vocabulary, but the dashboard is not the reason to believe the metrics. The reason is older than the tools and rests on three foundations, not on vendor benchmarks.

Lean flow and small batches. From the Toyota Production System: shrink the unit of work and both quality and speed rise. In software the unit of work is the pull request, which is why change size is the best-evidenced lever in this whole note. Reinertsen’s Principles of Product Development Flow (2009) is the rigorous treatment of why queues, not effort, dominate delivery time.
Little’s Law. Cycle time = work-in-progress / throughput. A theorem, not a benchmark. Cap WIP and cycle time falls arithmetically, which is why limiting open PRs beats any coding-speed gain and why Work-in-Progress Limits sits upstream of every flow metric here.
The Accelerate research. Forsgren, Humble, and Kim ran structural equation modeling on years of State of DevOps survey data and found delivery performance predicts organizational performance (Accelerate, 2018). DORA is the output of that program, not a vendor pitch.

The honest limit: even Accelerate is correlational and self-reported. Nobody has cleanly shown that hitting a given number causes better business results. So the defensible use is diagnostic, not evaluative. These metrics find the bottleneck (Theory of Constraints). They do not certify that you are winning, and they say nothing trustworthy about an individual. Hold that thought; it is the whole ballgame when a number lands in a performance review.

DORA: the four keys and the fifth

The DORA program (DevOps Research and Assessment, now at Google) has measured software delivery since 2014. Four metrics, split into two throughput and two stability, plus a fifth added in 2024. Elite-performer thresholds from the 2024 State of DevOps report are in each entry.

Deployment Frequency. How often an organization ships to production.

Measured: count of production deployments per unit time, or the inverse (time between deploys). Elite: on demand, multiple times per day.
AI moves: rises if review and release keep pace, but frequency in isolation is a vanity metric. Deploying 40 times a day says nothing if half are hotfixes. Watch it only alongside change failure rate.

Lead Time for Changes. Time from a change committed to version control to that change running in production.

Measured: timestamp of first commit on a change to timestamp of its production deploy, taken as a median or distribution, not a mean (tail-heavy). Elite: less than one day.
AI moves: this is the headline throughput number. AI compresses the coding portion of lead time sharply and the review-plus-integration portion barely at all, so the share of lead time spent in review grows. If total lead time does not fall, the coding speedup is being absorbed downstream.

Change Failure Rate (CFR). Share of deployments that degrade service and need remediation (hotfix, rollback, fix-forward, patch).

Measured: failed deployments divided by total deployments, over a window. Elite: around 5%.
AI moves: the critical guardrail. One enterprise study found AI-coauthored pull requests carried 1.7x more issues than human-only PRs, a correlation that partly reflects AI being pointed at larger or gnarlier changes, not proof AI writes worse code. Either way, speed gains that raise CFR are not gains; the rework erases them. Track CFR against lead time as a matched pair.

Failed Deployment Recovery Time (formerly Time to Restore Service, often MTTR). How long to recover once a deployment fails in production.

Measured: incident start (failed deploy detected) to service restored, median. Elite: under one hour.
AI moves: indirect. Faster deploys mean smaller blast radius per change, which helps recovery; more change volume means more incidents to recover from. Net effect depends on whether the review gate holds.

Deployment Rework Rate (added 2024). Share of deployments that require an unplanned follow-up fix.

Measured: deployments needing a subsequent corrective deploy, divided by total. DORA added it after finding change failure rate was quietly acting as a proxy for rework volume.
AI moves: expected to rise with AI-generated code unless the review gate catches defects pre-merge. A direct read on “shipped faster, but did we ship right?”

PR-flow: decomposing the review bottleneck

DORA tells you the system slowed or sped up. These tell you where inside the pull request the time went. This is the layer AI-era measurement lives in, because review and merge are the human-paced stages that absorb upstream AI gains.

PR Cycle Time (time to merge). Total elapsed time from a pull request opened to merged.

Measured: PR open timestamp to merge timestamp, median across PRs. Healthy median for mid-market and enterprise teams: under 24 hours. Often decomposed into coding time, pickup time, review time, and deploy time.
AI moves: the single best throughput proxy at the PR level. A study of teams going from 0% to 100% AI adoption saw a 24% reduction in median PR cycle time; a separate enterprise rollout saw a 31.8% reduction in PR review cycle time. Both are correlational, and the 24% figure is from DX, a vendor that sells productivity measurement, so read it as directional rather than settled.

Time to First Review (pickup time). From PR opened to its first review comment or approval.

Measured: PR open to first review event, median. A common rule of thumb is under 4 hours, and teams widely report that PRs picked up within a day are markedly more likely to merge without rework (a correlation, not a proven cause).
AI moves: the metric most likely to degrade under AI. When AI triples the number of open PRs, a fixed pool of reviewers means each PR waits longer for pickup. This is the queue where the attenuation happens. An AI review gate exists largely to defend this number.

Review Time. From first review to merge (the active back-and-forth).

Measured: first review event to merge, median. Separates “waiting to be looked at” (pickup) from “being worked through” (review).
AI moves: AI review bots shorten it by handling the mechanical first pass, freeing humans for judgment. Rises if AI-generated PRs are large or low-quality enough to trigger more revision rounds.

PR Size. Lines changed (added plus deleted) per pull request.

Measured: diff size per PR, median. Small changes review better, and this is the best-evidenced claim in the note: Google’s practices push for small changes, and the SmartBear/Cisco study of 2,500 reviews found defect detection falling off past 400 lines (roughly 87% yield under 100 lines, 28% over 1,000). Ideal is under 200. Change size is batch size wearing a GitHub costume, which is why it also predicts post-merge churn.
AI moves: AI tends to enlarge PRs (it generates code cheaply), which quietly degrades review quality and raises rework. A leading indicator to watch: if AI adoption rises and median PR size rises with it, review quality is silently eroding even if cycle time looks fine.

Code Churn / Rework Rate. Code rewritten or deleted shortly after it was merged.

Measured: share of lines changed that are revised within a recent window (commonly three weeks). The construct is well-grounded: Nagappan and Ball found relative code churn predicts defect density with about 89% accuracy on Windows Server 2003. The under-4% threshold, though, is a vendor rule of thumb, not a finding.
AI moves: a direct quality read on AI output. Cheap-to-generate code that gets ripped out weeks later is negative throughput dressed as velocity. Rising churn alongside rising AI adoption is a strong signal the speed is being paid back in rework.

AI-specific gauges, and the ones to distrust

Beyond delivery metrics, AI programs add usage gauges. Treat them as inputs, not outcomes.

AI adoption / utilization. Share of developers actively using the tools, or share of merged PRs with AI assistance. Useful for tracking rollout; says nothing about value delivered.
Suggestion acceptance rate. Share of AI suggestions accepted. A vanity metric on its own: accepting more code is not shipping more value, and it correlates with larger PRs and more churn. High acceptance with rising rework is a warning, not a win.
Lines of code. Still the oldest trap. Measuring output by lines is like measuring a car’s speed by engine noise, and AI makes lines nearly free. Under AI, LoC measures the tool, not the team.

The DORA program’s own 2024 guidance on the “tokenmaxxing” era: AI raises throughput and instability at once, so speed metrics are only meaningful when read against the stability pair.

How to read them without getting fooled

Four failure modes.

Goodhart’s law. The day a metric becomes an individual target, it stops measuring reality and starts measuring fear. Developers split PRs to inflate deployment frequency and suppress incident reports to protect CFR. Read these by team and by trend, never per-person, never tied to compensation. Team metrics create collaboration; individual metrics create gaming.

Speed without a counterweight. Every throughput metric needs a paired stability metric or it drives the wrong behavior. This is the design principle behind the DX Core 4, which unifies DORA, SPACE, and DevEx into four counterbalanced dimensions (speed, effectiveness, quality, impact) precisely so a gain in one cannot hide a loss in another. See DX Core 4.

The measurement lag. Allow three to six months after AI rollout before drawing conclusions. Developers need time to build effective AI workflows, and early numbers capture the learning curve, not the steady state.

Never an individual verdict. The one with real stakes. These are team-level, trend-level diagnostics, and using them to rank, PIP, or fire a person is both statistically indefensible and self-defeating. Indefensible because the signal is confounded: work is not randomly assigned, one engineer takes the six-week migration while another ships CRUD, and per-person samples are far too small to separate skill from assignment or from the luck of the sprint. Self-defeating because the moment people know a number can end their job, they optimize the number and the underlying data dies, so the manager loses the very signal they were trying to act on. DORA’s own guidance is explicit: never tie these to individual performance or compensation. A metric that justifies ending someone’s employment needs a causal, attributable, confound-free claim about that person, and none of these clear that bar. They point at where the system is slow. To judge a person you still have to go read the actual work.

The underlying discipline is Theory of Constraints: the only metric worth optimizing is the one at the current bottleneck, and AI moves that bottleneck from writing to reviewing. Optimizing authoring throughput once review is the constraint just builds inventory in front of the queue, which is also why Work-in-Progress Limits on open PRs often does more for cycle time than any coding-speed gain.

The minimum viable set

Fifteen metrics is a menu, not a dashboard. To measure an AI rollout, stand up four, in two pairs, and add the rest only when a question demands them:

Lead time for change (throughput) paired with change failure rate (stability). The DORA core: did the system get faster without getting less stable. This is the pair that survives Goodhart, because moving one the wrong way shows up in the other.
PR cycle time (throughput) paired with code churn / rework (stability). The PR-level pair that explains where the lead-time change came from and whether the speed is real or borrowed against future rework.

Read all four by team and by trend, overlay AI adoption, and give it three to six months before trusting the numbers. If forced to watch a single leading indicator, watch time-to-first-review: it degrades first when AI floods the review queue.

Try it

Compute your own PR-flow distribution (1-2 hours, GitHub + a script). Pull merged PRs for one repo over 90 days via the GitHub API (gh pr list --state merged --json createdAt,mergedAt,additions,deletions,reviews). Compute median time-to-first-review, PR cycle time, and PR size. Then split the set by whether the PR was AI-assisted (a label, or a co-author trailer). Look for the predicted pattern: AI PRs larger, cycle time similar or better, but time-to-first-review worse as the queue lengthens. If AI PR size is up and churn is up, the speed is being paid back in rework.

Instrument lead time for change (an afternoon, GitHub Actions). Emit a timestamped event at first commit and at production deploy, difference them per change, and chart the median weekly. Overlay AI adoption. The question the chart answers: did total lead time actually fall, or did coding time fall while review time rose to fill the gap? The second pattern is the attenuation result, releases rising far slower than code showing up in your own data.

Sources

DORA, “DORA’s software delivery performance metrics”. Canonical definitions of the four keys and the fifth metric.
DORA, “Finding balance in the era of tokenmaxxing”. DORA’s own read on AI raising throughput and instability together.
DX, “Measuring developer productivity with the DX Core 4”. The counterbalanced framework unifying DORA, SPACE, and DevEx.
“Intuition to Evidence: Measuring AI’s True Impact on Developer Productivity,” arXiv:2509.19708 (2025). Source for the 31.8% review-cycle-time reduction and the 1.7x-more-issues finding on AI-coauthored PRs.
Google, “Engineering Practices: Code Review”. The small-change guidance behind the PR-size thresholds.
Sadowski et al., “Modern Code Review: A Case Study at Google,” ICSE-SEIP 2018. 9M reviews; most changes small, 70% merged within 24 hours, 80% one iteration or fewer. The empirical grounding for review-latency and change-size norms.
SmartBear / Cisco, “Code Review at Cisco Systems”. 2,500 reviews, 3.2M lines; defect detection drops off past 400 lines. The backbone for PR size as a quality lever.
Nagappan & Ball, “Use of Relative Code Churn Measures to Predict System Defect Density,” ICSE 2005. Relative churn predicts defect density at ~89% accuracy. The grounding for churn/rework as a quality signal.
Forsgren, Humble, Kim, Accelerate: The Science of Lean Software and DevOps (2018). The structural-equation research program behind DORA linking delivery performance to organizational performance.
Reinertsen, The Principles of Product Development Flow (2009). The Lean-flow and queueing theory (Little’s Law, cost of queues) underneath the PR-flow metrics.
GitKraken, “PR Cycle Time Benchmarks 2026”. Vendor benchmark for healthy PR-flow medians; useful directionally, not a controlled finding.