Back to Posts
Mar 19, 2026

AI can’t replicate research papers (and neither can most humans)

OpenAI released PaperBench in April 2025. The benchmark measures how well AI agents write code and replicate ML papers from ICML.

Claude 3.5 Sonnet got 21% on the full benchmark (o1 hit 43.4% on code-only). Human ML PhDs (given 48 hours) got 41%. For ML PhDs, that sucks.

We’re in 2026, we have Opus 4.6, Deepseek R1 and o3. Here’s why reproducibility still sucks. There’s nuance between replicability and reproducibility - here I mean the same thing.

What’s changed?

Opus 4.6 (Feb 2026) got 80.8% on SWE-bench Verified.

DeepSeek R1 is open-source, runs locally (I use the 8b model everyday), and competes with o1 on reasoning.

What hasn’t changed?

SWE-bench tests if models can fix Github issues. PaperBench tells a model to look at a PDF and tests if the model replicates it.

Loosely, that’s:

  1. Read an ML paper (PDF)
  2. Understand contributions
  3. Write a codebase from scratch
  4. Run experiments
  5. Match original results

The model has to look at the PDF, make sense of paragraphs of text, images, the context, datasets. All good. They also have to actually get the datasets, replicate original experiments,

Replicating papers is frustrating. Some say- tuned hyperparameters, adjusted learning rates, standard initialization. Your guess is as good as mine.

Then, there’s the issue of missing code/ datasets. Undocumented, hardcoded paths, missing dependencies. Some say they scraped Twitter and data isn’t available. Missing seeds, they trained on 8x A100s, you have a 1080 Ti.

Jupyter notebooks make it worse

I use Jupyter notebooks for prototyping. To test fast, I make sure it works, then refactor, make functions take args, add tests, keep .py files.

Notebooks are terrible for reproducibility.

You can run cells out of order. Variables persist between cells (if you don’t restart kernel). Notebooks are JSON - you need separate tools to see code diff.

Journals that care about replication

ReScience C

I think it’s a great platform. You cannot publish your own work - you have to replicate someone else’s work. Free, open-source, open-access.

They don’t give a doi (you get one from Zenodo), but they accept as long as it’s your own work. If it fails, they’ll publish it. If you can’t replicate a paper, they’ll publish it.

Most journals need novelty to add to their catalogue. Most effort is messy. Experiments fail. I think failed experiments can teach us just as much as the ones that are ground-breaking.

Papers with Code

It’s a platform that links papers to code. You publish, you attach your code. People try it. If it works, great. If it doesn’t, people tell you.

What’s changing

Reflections

DeepSeek showed you can distill a 671B model’s reasoning into an 8B model. It’s not the greatest model, but it runs locally, it’s fast enough for n8n or openclaw, and you don’t have to pay anybody. You don’t run out of credits.

Having open weights is great.

Without code/ data, your research is just a story. Might as well call it fiction. A reproduce.sh that regenerates results would fix most of the issues we see.

Why should it not be a publication if you’re replicating someone’s work?

My workflow

Honestly, I don’t always write tests when I’m running experiments. I care about that later. If it works, I refactor (extract functions to .py, add arg parsing, write a train script).

Python files are way easier to track/ diff/ refactor. I use notebooks to load trained models, make plots and share results.

I’ve been using wandb for tracking runs. Takes a minute to set up.

Once humans can replicate papers, I’m sure models will catch up.

Footnotes:

  1. PaperBench is open source: https://github.com/openai/preparedness/tree/main/project/paperbench . Released April 2025, still the main benchmark for research replication.

  2. ReScience C is at https://rescience.github.io/ . They’ve published replication studies since 2015. They accept negative results, which most journals don’t.

  3. Papers with Code runs reproducibility challenges: https://paperswithcode.com/ . They link papers to code. If your paper isn’t there, add it.

  4. DeepSeek R1 was released January 2025, open-sourced under MIT license. You can run the 8B distilled version on a laptop with 16GB RAM. The full 671B param model with MoE architecture activates only 37B params per forward pass, making it cheaper to run than you’d think.

  5. Claude Opus 4.6 (Feb 2026) introduced 1M token context window, Agent Teams (parallel multi-agent work), and extended thinking. Scores 80.8% on SWE-bench Verified but that’s coding, not research replication. Different skills.

  6. The 41% human score in PaperBench is for ML PhDs given 48 hours. Not random students. Not 24 hours. Experts with time still only hit 41%.

Related Posts