Jon Moshier / Notes / Writing Code vs. Shipping Code: AI Productivity Across Tool Generations draft
Note · From the Notebook

Writing Code vs. Shipping Code: AI Productivity Across Tool Generations

A 100,000-developer study finds that AI coding tools produce massive gains in raw coding activity but only modest gains in released software, because human bottlenecks at review and integration stages swallow most of the upstream productivity.

Writing Code vs. Shipping Code: AI Productivity Across Tool Generations

Demirer, Musolff, and Yang (NBER Working Paper 35275, May 2026) ask two questions: do productivity gains grow as AI coding tools advance across generations, and do those gains actually reach users as shipped software? Their answers are yes and mostly no.

Three generations, three magnitudes

The study tracks 100,000+ GitHub developers matched with telemetry from Microsoft’s AI subscriptions. It covers three tool generations:

The boundary is not clean. The paper notes that tools like Claude Code and Codex run in either mode depending on the task: a developer can drive them interactively (sync) or hand off a larger task they execute autonomously and return as a pull request (async).

Each generation substantially increases raw coding output. The cumulative effects on commits, relative to matched non-adopters:

Tool generationCommitsLines of code
Autocomplete+40%+228%
+ Sync agents+140%+741%
+ Async agents+180%(comparable)

Effects are larger for less active developers, but remain substantial across the activity distribution.

The attenuation pattern

The core finding is what happens as you move from code-writing metrics up the production hierarchy:

Lines of Code → Files → Commits → Pull Requests → Projects → Releases

Sync agents produce a 741% increase in lines of code. By the time you reach releases, that 741% has compressed to 20%. Async agents raise commits by 180% but releases by only 30%. The gains do not disappear. They are real. But each successive stage absorbs a large share of them.

This is not a measurement artifact. The authors validate with aggregate GitHub data: public new repositories and pull requests rose sharply from early 2025 onward, consistent with developer-level estimates. The attenuation is structural.

The paper extends the [private link] and Jones (2011, 2026) weak-link framework to a vertical production hierarchy. Each layer of software production combines upstream output (code, commits, PRs) with downstream human effort (review, integration, release management) via a CES production function. The key parameter is σ, the elasticity of substitution between AI output and human effort.

Calibrating σ to the observed attenuation pattern yields σ = 0.25, indicating strong complementarity. The inputs are more like complements than substitutes. When σ < 1, even unbounded automation of upstream tasks yields finite total output gains, because the downstream human stage remains a constraint. The measured magnitudes show this directly: the paper’s Figure 1 reports cumulative all-tools effects of 17.3x on lines of code but only 1.3x on releases. The upstream firehose narrows to a trickle at the point of shipping.

This differs from how AI coding tools are usually discussed. The gains are not capped because AI writes bad code. They are capped because pull request review and merge approval are still human-paced processes. The binding constraint shifts from writing to reviewing.

This is structurally related to the Bullwhip Effect and the broader systems-thinking observation that behavior in multi-stage production chains is governed by the weakest link, not the strongest. Where Systems Thinking places leverage at the design and goal levels, the leverage here sits in the review and integration stages that AI has not yet touched.

App marketplaces: more supply, flat usage

To measure whether increased developer output reaches users, the authors assembled monthly panels for four app marketplaces: Apple App Store, Google Play Store, Chrome Web Store, and SourceForge.

New application creation accelerated from mid-2025 onward (sharply on Apple and Chrome, mildly on Google Play, negligibly on SourceForge). But across all four, total app usage within the first three months of launch has not increased. The share of new applications that fail to attract any meaningful audience has risen.

Two interpretations: either the marginal apps are lower quality (AI-assisted code that passes but doesn’t deliver real value), or there is a consumer-side bottleneck in discovery and adoption that lags the supply expansion. The data cannot currently distinguish between them.

A complicating result: experienced developers

The large observational effects in Demirer et al. sit alongside a smaller, randomized study that points a different direction. Becker, Rush, Barnes, and Rein (2025) ran a controlled experiment with 16 experienced open-source developers on 246 real tasks. They found AI tools (primarily Cursor Pro with Claude 3.5/3.7 Sonnet) slowed experienced developers by 19%, despite developers forecasting a 24% speedup and estimating post-task that AI helped by 20%.

The studies differ in population (16 highly experienced contributors vs. 100,000 mixed-activity GitHub users), design (RCT vs. observational matched event study), and context (pre-specified tasks in mature codebases vs. organic self-directed work). One plausible reconciliation: the observational study captures developers self-selecting into the tasks and workflows where AI helps most. Another: experienced developers in complex, mature codebases face different constraints than the median GitHub contributor.

The comparison matters because it raises the question of where in the activity distribution AI’s gains are concentrated, and whether they hold as developer experience and codebase complexity increase.

A separate RCT, Cui, Demirer, Jaffe, Musolff, Peng, and Salz (2026) published in Management Science, found a 26% increase in completed tasks across 4,867 developers at Microsoft, Accenture, and a Fortune 100 company. Less experienced developers again showed larger gains, consistent with the distribution pattern in Demirer et al.

Try it

Track your own production hierarchy (1–2 hours, any active repo). For one week, log: commits made, PRs opened, PRs merged, and features released. Then enable an async agent (GitHub Copilot Coding Agent or Codex) and repeat for the next week. Watch what changes at each layer. Expect the ratio of commits to releases to widen. The interesting signal is whether your review queue grows faster than your merge rate.

Calibrate your own elasticity (an afternoon, any small project). Assign an async agent three clearly-scoped issues. Measure: time to first PR, time from PR to merge, and whether you merged without revision. The gap between PR creation and merge is your human bottleneck. If merge takes longer than writing, you have σ < 1.

See also

Sources

← All notes Read recent essays →