Recursive Self Improvement in AI

Recursive self-improvement (RSI) is the idea that an AI system could improve its own ability to improve itself, each generation building a better successor. The term covers two very different things in 2025: a 60-year-old theoretical loop that has never been observed, and a set of working systems that improve the code and prompts around a frozen model. The gap between them is where the interesting questions live.

The lineage: from Good to the Gödel machine

The seed idea is I.J. Good’s 1965 “Speculations Concerning the First Ultraintelligent Machine”: an ultraintelligent machine could design even better machines, so “the intelligence of man would be left far behind,” an “intelligence explosion.” Good’s loop is about returns on cognitive investment. If improving cognition raises the rate at which you can improve cognition, the curve goes vertical.

The first formal version is Jürgen Schmidhuber’s [private link]: a self-referential program that rewrites any part of its own code, including the rewrite logic, but only when it can prove the change raises expected reward. It is provably optimal and practically inert, because proving such theorems is intractable. That tension defines the field. The clean theoretical object can’t run; the things that run aren’t clean.

What is actually self-improving in 2025

Three systems mark the current frontier, and they share a structure the theory didn’t predict: the foundation model stays frozen, and what improves is the scaffold around it (the agent’s own code, prompts, and tools).

AlphaEvolve (Google DeepMind, May 2025). A Gemini-powered evolutionary coding agent. It mutates and recombines candidate algorithms, scores them with automated evaluators, and keeps the winners. It found a way to multiply two 4x4 complex matrices with 48 scalar multiplications, beating Strassen’s 1969 algorithm (49) for the first time in 56 years. It also improved Google’s data-center scheduling and chip design, and sped up training of the very Gemini models it runs on. That last part is a real, if shallow, recursive loop.
The Darwin Gödel Machine (Sakana AI, Zhang et al., arXiv 2505.22954). Named for Schmidhuber but it drops the proof requirement and substitutes empirical validation, hence “Darwin.” It is a coding agent that edits its own Python, then tests each variant on benchmarks. Instead of greedily keeping the best, it maintains an archive of all variants (open-ended evolution) so dead-looking branches can be revisited. It raised its own SWE-bench score from 20.0% to 50.0% and Polyglot from 14.2% to 30.7%, and discovered behaviors no one coded: patch validation, better edit tools, a memory of past errors. Gains transferred across underlying models and languages.
Recursive training-loop methods. LADDER has a model generate easier variants of problems it can’t solve, then learns up the ladder via RL, taking Llama 3.2 3B from 1% to 82% on undergraduate integration. RISE reframes a task as a multi-turn process and fine-tunes the model to correct its own prior answers. These touch the weights, but each round still needs a human-set curriculum or reward.

The verifier is the load-bearing part

The common mechanism across every working system is a cheap, reliable verifier. AlphaEvolve has automated evaluators. The Darwin Gödel Machine has SWE-bench’s test suites. LADDER has a symbolic integration checker. The model proposes; the verifier disposes. Improvement compounds only in the slice of problem space where ground truth is cheap to check.

This is the same boundary that shows up in Model Collapse and the AI Data Crisis: synthetic data helps when a verifier filters the garbage out of the loop (AlphaZero’s game outcomes, a formal proof, a unit test) and degrades the model when it doesn’t. Self-improvement and self-poisoning are the same loop with and without an oracle. Coding and math are improving fast because they come with verifiers attached. Taste, strategy, and open-ended writing don’t, which is why nobody demonstrates RSI there. A January 2026 analysis argues this is structural, not temporary: without symbolic model synthesis, the singularity is not near.

Is it explosive? The compute-versus-labor debate

The live empirical question is whether automating AI research triggers Good’s runaway loop or hits a wall. Two pieces of evidence pull in opposite directions.

METR’s time-horizon work measures the length of task an agent can complete at 50% reliability and finds it doubling roughly every 7 months from 2019 to 2025, with their updated TH1.1 estimate putting the post-2023 doubling near 4.3 months. That is the curve an optimist extrapolates toward AI doing AI R&D.

Against that, Whitfill and Wu’s Will Compute Bottlenecks Prevent an Intelligence Explosion? estimates the elasticity of substitution between research compute and cognitive labor across OpenAI, DeepMind, Anthropic, and DeepSeek from 2014 to 2024. Their baseline model says compute and labor are substitutes (you can trade more thinking for less hardware). But their frontier-experiments specification, which accounts for the scale of state-of-the-art runs, says they are complements. If they are complements, infinite cognitive labor with fixed compute hits a hard ceiling. A software-only intelligence explosion would stall on a hardware ceiling that no amount of cleverness can conjure away. The result is paradigm-dependent, which is the honest state of the debate. The 2025 systems automate research tasks; none has demonstrated the self-sustaining loop, and gains often shrink once inference-time tricks are stripped out, mirroring the human-review bottleneck documented in Writing Code vs Shipping Code - AI Productivity Across Tool Generations.

Try it

Build a minimal Darwin-Gödel loop (a weekend, Python + any LLM API). Write a tiny coding agent (read file, edit file, run tests) as a single script. Give it one job: improve its own edit function so it passes more cases on a held-out set of SWE-bench Lite tasks or even a handful of LeetCode problems with hidden tests. Each iteration: ask the model to propose a code change to the agent, apply it in a sandbox, run the test suite, keep the variant in an archive scored by pass rate. Watch for two things the paper reports: emergent helper behaviors you didn’t prompt (it adds retry logic, logging, validation), and the moment progress flatlines once the easy verifier-gated wins are exhausted. The flatline is the whole argument made visible.

Strip the verifier (1-2 hours, same setup). Re-run the loop but replace the test suite with the model grading its own output. Self-improvement should stall or reverse almost immediately. That contrast is the verifier dependency from the section above, reproduced on your laptop.

Sources

Recursive self-improvement (Wikipedia) — Good’s intelligence explosion, seed AI, the Gödel machine.
Zhang et al., Darwin Gödel Machine: Open-Ended Evolution of Self-Improving Agents (arXiv 2505.22954) — the archive-based self-rewriting coding agent and its benchmark gains.
DeepMind, AlphaEvolve — evolutionary algorithm discovery, matrix multiplication and data-center results.
Whitfill & Wu, Will Compute Bottlenecks Prevent an Intelligence Explosion? (arXiv 2507.23181) — compute/labor substitutability across four frontier labs.
METR, Measuring AI Ability to Complete Long Tasks and Time Horizon 1.1 — the doubling-time evidence.
On the Limits of Self-Improving in Large Language Models (arXiv 2601.05280) — the case that RSI stalls without symbolic synthesis.