AI can’t replicate research papers (and neither can most humans)

OpenAI released PaperBench in April 2025. The benchmark measures how well AI agents write code and replicate ML papers from ICML.

Claude 3.5 Sonnet got 21% on the full benchmark (o1 hit 43.4% on code-only). Human ML PhDs (given 48 hours) got 41%. For ML PhDs, that sucks.

We’re in 2026, we have Opus 4.6, Deepseek R1 and o3. Reproducibility still sucks. (There’s some nuance between replicability and reproducibility - here I mean the same thing)

What’s changed?

Opus 4.6 (Feb 2026) got 80.8% on SWE-bench Verified.

DeepSeek R1 is open-source, runs locally (I use the 8b model everyday), and competes with o1 on reasoning.

What hasn’t changed?

SWE-bench tests if models can fix Github issues. PaperBench tells a model to look at a PDF and tests if the model replicates it.

Loosely, that’s:

Read an ML paper (PDF)
Understand contributions
Write a codebase from scratch
Run experiments
Match original results

The model has to look at the PDF, make sense of paragraphs of text, images, the context, datasets. All good. They also have to actually get the datasets, replicate original experiments,

Replicating papers is frustrating. Some say- tuned hyperparameters, adjusted learning rates, standard initialization. Your guess is as good as mine.

Then, there’s the issue of missing code/ datasets. Undocumented, hardcoded paths, missing dependencies. Some say they scraped Twitter and data isn’t available. Missing seeds, they trained on 8x A100s, you have a 1080 Ti.

Jupyter notebooks make it worse

I use Jupyter notebooks for prototyping. To test fast, I make sure it works, then refactor, make functions take args, add tests, keep .py files.

Notebooks are terrible for reproducibility.

You can run cells out of order. Variables persist between cells (if you don’t restart kernel). Notebooks are JSON - you need separate tools to see code diff.

Journals that care about replication

ReScience C

I think it’s a great platform . You cannot publish your own work - you have to replicate someone else’s work. Free, open-source, open-access.

They don’t give a doi (you get one from Zenodo), but they accept as long as it’s your own work. If it fails, they’ll publish it. If you can’t replicate a paper, they’ll publish it.

Most journals need novelty to add to their catalogue. Most effort is messy. Experiments fail. I think failed experiments can teach us just as much as the ones that are ground-breaking.

Papers with Code

It’s a platform that links papers to code. You publish, you attach your code. People try it. If it works, great. If it doesn’t, people tell you.

What’s changing

FSE 2026 requires replication packages or an explanation why not
SANER 2026 has a RENE track (Reproducibility Studies and Negative Results)
NeurIPS has a reproducibility checklist
MLSys 2026 wants you to submit artifacts

My workflow

Honestly, I don’t always write tests when I’m running experiments. I care about that later. If it works, I refactor (extract functions to .py, add arg parsing, write a train script).

Python files are way easier to track/ diff/ refactor. I use notebooks to load trained models, make plots and share results.

I’ve been using wandb for tracking runs. Takes a minute to set up.

Reflections

DeepSeek showed you can distill a 671B model’s reasoning into an 8B model. It’s not the greatest model, but it runs locally, it’s fast enough for n8n or openclaw, and you don’t have to pay anybody. You don’t run out of credits. You don’t really need the biggest model or the most data to train models .

Having open weights is great.

Without code/ data, your research is just a story. Might as well call it fiction. A reproduce.sh would be an easy fix. Honestly, unless you want to be snobbish, why should it not be a publication if you’re replicating someone’s work?

Once humans can replicate papers, I’m sure models will catch up.

Footnotes:

PaperBench is open source: https://github.com/openai/preparedness/tree/main/project/paperbench . Released April 2025, still the main benchmark for research replication.
ReScience C is at https://rescience.github.io/ . They’ve published replication studies since 2015. They accept negative results, which most journals don’t.
Papers with Code runs reproducibility challenges: https://paperswithcode.com/ . They link papers to code. If your paper isn’t there, add it.
Claude Opus 4.6 (Feb 2026) has a 1M token context window and scores 80.8% on SWE-bench Verified.
The 41% human score in PaperBench is for ML PhDs given 48 hours. Not random students. These are experts.

Deep learning through the lens of Felix Klein's Erlangen

machine learning, geometric deep learning, graph neural networks, ANAIS, NAAMII

Dec 30, 2025