Autoresearch: a closer look at the agent that runs its own experiments

Introduction

Autoresearch is an agentic setup – a system that hands an AI agent the keys and lets it work on its own. You give the agent a 1-page file with instructions that describe:

what you want it to improve
what counts as success, and
what it’s allowed to change.

From there, the agent takes over. In our case, it started editing the training code, running it, inspecting the results, deciding what to try next, editing again – and looping like that on its own until it hit the budget.

The autoresearch repo itself showcases this setup by pointing the agent at training Nanochat – a small but real language model that covers all the usual stages of building an LLM, including tokenization, pretraining, finetuning, evaluation and inference. The objective is the following: achieve GPT-2 level capabilities in as little time as possible, measured by a metric called Time-to-GPT-2 on a standard benchmark called DCLM CORE.

This task is far from trivial – there’s even a public leaderboard tracking who can do it fastest, and at this point Autoresearch has beaten the engineers who built Nanochat itself!

Autoresearch isn’t limited to LLMs or traditional ML either. The setup is intentionally generic – point it at a forecasting model, a recommendation engine, a media plan, or an operational workflow, and it works the same way.

If you can measure success, the agent can optimize for it.

Most agents just use AI models as building blocks. Autoresearch promises to build and improve them. We had to try it ourselves.

Here’s what we found.

Setup

Setup is genuinely easy. We cloned the repo, followed the README, picked Claude Sonnet to power our agent and kicked off an open-ended experimentation loop on the Nanochat model to run overnight.

How it works

Every few minutes, the agent runs a quick experiment: it changes one thing about the training process and checks if the model got any better. If it got better, it keeps the change and builds on top of it. If it got worse, it throws the change away and goes back to the best version so far. It just keeps looping like this on its own, slowly nudging the model in a better direction.

Results

Overnight, the agent ran 70 experiments on its own and improved the Time-to-GPT-2 metric by 11.26%. The whole run finished by morning and cost about $60 in API calls.

The agent didn’t just tune the dials of the training process; it also made small architectural changes, explaining the reasoning behind each one along the way. You can push it further too: ask it to do deep research before experimenting, or cite papers to back up its choices.

Session metric	Value
Total cost	$61.69
Total duration (API)	2h 18m 3s
Total duration (wall)	18h 32m 45s
Total code changes	2,335 lines added, 73 lines removed
Model	claude-sonnet-4-5
Tokens (input / output)	38.7k / 275.2k
Cache (read / write)	65.8m / 10.1m

Table 1: Key session metrics on the overnight agentic experimentation.

What Worked

Autoresearch’s report felt like an actual researcher had worked on our model all night and left a thorough write-up for us to review.

The initial logs made us skeptical. They were mostly standard, “old fashioned” parameter tweaks. However, a few runs in, we started getting real architectural changes, each one paired with supportive evidence such as published and prior implementations.

What to watch out for

Cost is the obvious one. The agent is constantly calling an LLM, so the bill scales with whatever model you’ve plugged in – anywhere from free with open-source models to several thousand dollars overnight if you go with a frontier one.

If that puts you off, there’s a free, multi-agent variant that takes a different approach: rather than throwing one expensive model at the problem, it has several cheaper ones collaborate and tries to get most of the way there.

What’s reusable

Autoresearch isn’t really a system. It’s a pattern: a short instruction file, a success metric, an improvement loop. Anything that fits that shape is fair game. The same setup that tuned a language model overnight could be used to optimize a forecasting model scored against held-out data, an LLM prompt scored against an eval set, a trading strategy backtested on historical prices, a piece of code scored by its test suite.

Bottom Line and WPP Applications

Autoresearch is a small setup that punches well above its weight. It’s a low-lift thing to try, and it actually delivers – we’re keen to throw it at more problems and see how far it goes.

This kind of agentic workflow is very familiar to our team at WPP Research. Earlier this year, our AlphaEvolve pod and Self-Improving Performance Agent pod explored very similar learning patterns:

With AlphaEvolve, we took Google DeepMind’s framework – AlphaEvolve, which uses Gemini to propose and evolve model architectures by itself – and turned it loose on actual campaign problems. We saw up to 10% gains in prediction accuracy and 7% in recommendation scores over our baselines, and got there much faster than usual. Details in the technical and executive write-ups.
With the Self-Improving Performance Agent, we built our own Prediction Optimization Agent – a system that turns an influencer post into a plain-language description, predicts how it’ll perform, and then rewrites its own instructions to get sharper over time. Across a dataset of over 10 million Instagram posts, a fine-tuned DistilBERT predictor reached an R² of 0.80, and the optimization loop kept landing on richer, more predictive descriptions with each round. Details in the technical doc and the executive summary.

This is the kind of work we love: agentic setups that quietly do the hard part, so we can build better systems, faster.

Author

Andreas Stavrou

Andreas is a Senior Data Scientist at Satalia and part of the WPP Research team. With over a decade of hands-on experience, he has delivered machine learning solutions across retail, online gambling, and credit risk, and now builds AI systems at scale in the global advertising industry.

Autoresearch: a closer look at the agent that runs its own experiments

Introduction

Setup

How it works

Results

What Worked

What to watch out for

What’s reusable

Bottom Line and WPP Applications

Author

Comments

Leave a Reply Cancel reply

More posts

Agree. Transact. Verify.

Learning through experience: teaching the viral Hermes agent to automate our work

A 50-line python function outperformed every frontier LLM – With 100% accuracy

MiroFish: Is swarm intelligence worth the cloud bill?