Jon Moshier

Model Collapse and the AI Data Crisis

Working through a Reddit doomer post on AI model collapse — what's real, what's overstated, and where synthetic data actually works.

Model Collapse and the AI Data Crisis

This Reddit post — “AI is deteriorating in realtime” — got me thinking. It was a citation list plus a personal rant; the OP’s sources:

  • Shumailov et al. — AI Models Collapse When Trained on Recursively Generated Data. Nature, July 2024. https://www.nature.com/articles/s41586-024-07566-y
  • Villalobos et al. (Epoch AI) — Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data. ICML 2024. https://arxiv.org/abs/2211.04325
  • OpenAI — o3 and o4-mini System Card (April 2025). PersonQA hallucination benchmark.
  • Gartner — forecast on synthetic training data, projecting 60% of training corpora by 2024.
  • Duke University Library — Generative AI Student Survey (January 2025).
  • DeepMind — AlphaZero (chess/Go from self-play); AlphaGeometry (Olympiad-level geometry from synthetic data).
  • Ed Zitron — The Truth About the AI Bubble & The Software Decline. https://www.wheresyoured.at/
  • Gary Marcus — How an AI feedback loop threatens to break ChatGPT. https://garymarcus.substack.com/

Alongside the citations, the OP shared a personal anecdote: they worked at a data-labeling vendor “that rhymes with chlorophyll” (Scale AI is the obvious guess) and claimed ~90% of expert contributors were quietly using ChatGPT to do the work they were being paid to do as humans.


What the post is claiming

Three connected claims, stacked into a doomer narrative:

  1. Model collapse — LLMs trained on the output of previous LLMs progressively degrade.
  2. Data exhaustion — high-quality human-written text is running out, forcing labs to use synthetic data, which feeds claim 1.
  3. Contaminated supply chain — even the “expert human” data labelers that companies like Scale, Surge, Invisible, etc. hire are quietly using ChatGPT to do their work, so the human-data side is also poisoned.

The narrative isn’t wrong, but it’s badly oversimplified. Each claim has a real technical core and a load-bearing nuance the post drops.


Claim 1: Model collapse (Shumailov et al., Nature 2024)

What the paper actually shows. Train a generative model on data sampled from a previous generation of the same model, iterate, and the model’s output distribution degenerates in two stages:

  • Early collapse — the model loses information about the tails of the original distribution. Rare events, minority modes, and unusual phrasings vanish first because they’re under-sampled at every generation.
  • Late collapse — the distribution narrows toward a low-variance, generic mode that no longer resembles the original. Variance compresses, then bias takes over.

They demonstrate this on Gaussian mixture models, variational autoencoders, and an OPT-125M language model fine-tuned on its own outputs across generations.

Why it happens — three error sources compounding:

  1. Statistical approximation error — finite sample sizes always underrepresent the true distribution, especially the tails.
  2. Functional expressivity error — your model architecture can’t perfectly represent the underlying distribution.
  3. Functional approximation error — your optimizer doesn’t perfectly fit the data even when it could.

Each generation amplifies the previous generation’s errors. It’s a low-pass filter on the distribution, applied repeatedly.

The nuance the post drops. Shumailov tested a pure self-loop — model N trains only on model N-1’s outputs. Frontier labs do not do this. They mix synthetic data with curated human data, filter aggressively, and use synthetic data primarily in domains where they have a verifier (more on this below). The follow-up “Is Model Collapse Inevitable?” line of work (Gerstgrasser et al., 2024) shows that accumulating real + synthetic data avoids collapse — it’s only replacing real with synthetic that kills you.

So: model collapse is a real and proven phenomenon in the failure mode the paper studied. It does not directly demonstrate that GPT-5 will be worse than GPT-4.


Claim 2: We’re running out of training data (Villalobos / Epoch AI)

The Epoch paper estimates the stock of public, high-quality human-generated text and projects when it will be exhausted at current scaling rates. The headline numbers:

  • ~300 trillion tokens of usable public text exists today
  • Frontier models are training on ~10–50T tokens per generation
  • At Chinchilla-optimal scaling, the supply runs out somewhere between 2026 and 2032

This is why every lab is now scrambling for:

  • Synthetic data — model-generated training data, ideally filtered or verified. Gartner’s widely-cited forecast projected synthetic data would represent ~60% of training corpora by 2024; the exact share is debated and the methodology was always loose, but the directional shift toward synthetic is real and well underway.
  • Private/licensed data — Reddit, Twitter/X, Stack Overflow, news archives, publishing deals.
  • Multimodal data — video and audio represent orders of magnitude more bits per hour.
  • Long-tail human data — paying experts to write fresh demonstrations (this is what Scale/Surge sell).

The post’s framing — “trying to extend the internet’s knowledge base by pouring cups into the sea” — is a memorable line but slightly misses the goal. Labs aren’t trying to make the internet bigger; they’re trying to extract higher-quality signal from a finite source, plus generate verifiable synthetic data in narrow domains.


Claim 3: Synthetic data is poison — except when it isn’t

This is the most important asymmetry in the whole debate, and the post brushes past it.

Where synthetic data works:

  • AlphaZero (DeepMind, 2017) — learned superhuman chess and Go from pure self-play. Zero human game data.
  • AlphaGeometry (DeepMind, Nature 2024) — solved Olympiad-level geometry from a synthetic dataset of ~100M theorems generated by a symbolic engine.
  • Code models trained on execution traces — the compiler/test suite is ground truth.
  • Math reasoning trained against proof checkers (Lean, Coq).
  • RLHF/RLAIF — Constitutional AI, RLAIF, and most modern post-training relies heavily on model-generated preference data.

The pattern: synthetic data works when you have a reliable verifier. Game outcome, formal proof, unit test pass/fail, type check. The verifier is the oracle that filters garbage out of the loop.

Where synthetic data fails:

  • Open-ended natural language with no oracle
  • General world knowledge (“write a paragraph about the French Revolution”)
  • Aesthetic/subjective judgment
  • Anything where “correct” is fuzzy

This is why model collapse is dangerous specifically for the open web / general-knowledge slice of training data — exactly where the contamination problem is worst. The AlphaZero/AlphaGeometry results don’t generalize to “synthetic data will save pretraining”; they prove the narrower point that synthetic data works with a verifier.


Claim 4: The contractor and student contamination problem

This is where the OP’s anecdote lands, and it’s well-supported:

  • A 2023 EPFL study (Veselovsky et al., Artificial Artificial Artificial Intelligence) estimated 33–46% of MTurk workers used LLMs to complete a text-summarization task they were being paid to do as humans.
  • Multiple journalist investigations (404 Media, NYT) have documented similar dynamics in Scale AI’s contractor workforce.
  • The Duke University Library Generative AI Student Survey (January 2025) found that a majority of surveyed Duke students reported using generative AI tools for coursework. That’s the same dynamic one tier upstream: the “human-written” text being produced right now — student essays, forum posts, blog comments — is increasingly co-written with LLMs. Future pretraining scrapes inherit that.
  • The economic incentive is the same in both cases: contractors are paid per task, students are graded per assignment, LLMs are faster, detection is hard.

The implication has two prongs:

  1. Paid pipelines. When OpenAI/Anthropic/Google buy “human-written” data from Scale, a meaningful fraction is silently AI-generated. The supposedly-clean side of the training mix is contaminated.
  2. The open web itself. The “scrape the internet” pretraining recipe is eating its own tail in slow motion — every year, a larger fraction of new web text was at least partially generated by a previous model.

Counter-nuance. Labs are aware of both problems and run statistical detection (perplexity outliers, stylistic fingerprints, agreement-vs-AI checks) on contractor outputs, plus increasingly aggressive source curation in pretraining. It’s an arms race, not a clean kill. The signal is degraded, not absent. And the OP’s “90% of contributors” number is one person’s experience at one vendor — directionally consistent with the published studies, but not a measurement.


Claim 5: o3 hallucinates more than o1 (OpenAI System Card, April 2025)

This is genuinely interesting and the post is roughly right. On the PersonQA benchmark, OpenAI’s own system card reported:

  • o1: ~16% hallucination rate
  • o3: ~33%
  • o4-mini: ~48%

OpenAI’s stated explanation: the reasoning models produce more claims overall per response, so both correct and incorrect claim counts go up — but incorrect grows faster. This is a real regression on this specific benchmark, and OpenAI did not hide it.

It does not generalize to “all models are getting worse.” Anthropic’s Opus 4 series, Google’s Gemini 2.5, and DeepSeek’s recent models show different hallucination trajectories on different benchmarks. The picture is “trade-offs between calibration and capability,” not “uniform decline.”


The bear case: Zitron and Marcus

The OP cites the two most prominent skeptics. Worth separating their arguments because they’re attacking different things and the OP collapses them into one.

Gary Marcus — How an AI feedback loop threatens to break ChatGPT (Substack). Marcus’s argument is technical: Claim 1 + Claim 4 dressed up as a structural reason LLMs will plateau. He’s the cognitive-science skeptic — LLMs lack symbolic reasoning, hallucinate structurally, and the synthetic-data contamination dynamic accelerates the ceiling. His framing of the feedback loop is the cleanest layperson summary of the concern. The weakness: he tends to read Shumailov’s pure self-loop result as the operating regime of frontier labs, which it isn’t — Gerstgrasser’s accumulation result is the more honest engineering picture.

Ed Zitron — The Truth About the AI Bubble & The Software Decline (Where’s Your Ed At / Better Offline). Zitron’s argument is economic: revenue doesn’t justify the capex, OpenAI’s unit economics don’t work, the bubble will pop. He’ll cite hallucination regressions or product regressions as supporting evidence, but the core claim is about valuations and runway, not model quality.

These can both be partially right. “The AI economics are unsustainable” and “the models are technically improving” are compatible claims. The most common mistake in this discourse — and the one the OP makes — is fusing them into a single “AI is collapsing” narrative.


What I’d actually take away

  1. Model collapse is real but conditional. Pure self-training loops collapse. Real labs don’t do that. Filtered + mixed + verified synthetic data is fine, sometimes better than human data.
  2. Data exhaustion is real and forcing real strategic shifts — into RL, into multimodal, into licensing deals, into post-training-heavy regimes (test-time compute, reasoning chains).
  3. Web contamination is the most under-addressed risk — once a large fraction of new web text is AI-assisted, the “scrape the internet” pretraining recipe degrades. Solutions exist (provenance signals, watermarking, source weighting) but adoption is patchy.
  4. The contractor problem is real and embarrassing, but it’s a known threat the buyers actively defend against.
  5. The o3 hallucination regression is real for that specific model on that specific benchmark. It is not evidence of a field-wide collapse — different labs are on different trajectories.
  6. The bear case is plausible on economics, weaker on technical capability. The two arguments should be evaluated separately.

The most honest summary: the easy gains from scraping the open web are mostly extracted, and the field is being forced into harder regimes (verified synthetic data, RL, post-training compute, multimodal). That’s not decline — that’s a regime change with real risks if executed badly.


Reading list (prioritized)

  1. Shumailov et al. — AI Models Collapse When Trained on Recursively Generated Data, Nature 2024. https://www.nature.com/articles/s41586-024-07566-y
  2. Gerstgrasser et al. — Is Model Collapse Inevitable? Breaking the Curse of Recursion by Accumulating Real and Synthetic Data, 2024. (Search arXiv — the crucial follow-up showing accumulation prevents collapse.)
  3. Villalobos et al. — Will We Run Out of Data? Limits of LLM Scaling Based on Human-Generated Data, Epoch AI / ICML 2024. https://arxiv.org/abs/2211.04325
  4. Trinh et al. — Solving Olympiad Geometry Without Human Demonstrations (AlphaGeometry), Nature 2024. (Existence proof that synthetic data + verifier works at the frontier.)
  5. Veselovsky et al. — Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use LLMs for Text Production Tasks, 2023. (The MTurk contamination study.)
  6. OpenAI — o3 and o4-mini System Card, April 2025. (PersonQA hallucination numbers come from here.)
  7. Duke University Library — Generative AI Student Survey, January 2025. (Upstream signal that “human-written” student text is increasingly co-written with LLMs.)
  8. Gary Marcus — How an AI feedback loop threatens to break ChatGPT. https://garymarcus.substack.com/
  9. Ed Zitron — The Truth About the AI Bubble & The Software Decline. https://www.wheresyoured.at/
← All notes