{"id":1458,"date":"2026-05-11T13:27:00","date_gmt":"2026-05-11T13:27:00","guid":{"rendered":"https:\/\/cms.research.wpp.com\/?p=1458"},"modified":"2026-05-11T13:52:16","modified_gmt":"2026-05-11T13:52:16","slug":"autoresearch","status":"publish","type":"post","link":"https:\/\/cms.research.wpp.com\/?p=1458","title":{"rendered":"Autoresearch: a closer look at the agent that runs its own experiments"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"576\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/autoresearch_2-edited.png\" alt=\"\" class=\"wp-image-1460\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/autoresearch_2-edited.png 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/autoresearch_2-edited-300x169.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/autoresearch_2-edited-768x432.png 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Autoresearch<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\">Introduction<\/h2>\n\n\n\n<p class=\"has-medium-font-size\"><a href=\"https:\/\/github.com\/karpathy\/autoresearch\">Autoresearch<\/a> is an agentic setup &#8211; a system that hands an AI agent the keys and lets it work on its own. You give the agent a 1-page file with instructions that describe:<\/p>\n\n\n\n<ul class=\"wp-block-list has-medium-font-size\">\n<li>what you want it to improve<\/li>\n\n\n\n<li>what counts as success, and<\/li>\n\n\n\n<li>what it&#8217;s allowed to change.<\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size\">From there, the agent takes over. In our case, it started editing the training code, running it, inspecting the results, deciding what to try next, editing again &#8211; and looping like that on its own until it hit the budget.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">The autoresearch repo itself showcases this setup by pointing the agent at training <a href=\"https:\/\/github.com\/karpathy\/nanochat\">Nanochat<\/a> &#8211; a small but real language model that covers all the usual stages of building an LLM, including tokenization, pretraining, finetuning, evaluation and inference. The objective is the following: achieve GPT-2 level capabilities in as little time as possible, measured by a metric called <em>Time-to-GPT-2<\/em> on a standard benchmark called DCLM CORE.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">This task is far from trivial &#8211; there&#8217;s even a public leaderboard tracking who can do it fastest, and at this point Autoresearch has beaten the engineers who built Nanochat itself!<\/p>\n\n\n\n<p class=\"has-medium-font-size\">Autoresearch isn&#8217;t limited to LLMs or traditional ML either. The setup is intentionally generic &#8211; point it at a forecasting model, a recommendation engine, a media plan, or an operational workflow, and it works the same way.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">If you can measure success, the agent can optimize for it.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">Most agents just use AI models as building blocks. Autoresearch promises to build and improve them. We had to try it ourselves.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">Here&#8217;s what we found.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Setup<\/h2>\n\n\n\n<p class=\"has-medium-font-size\">Setup is genuinely easy. We cloned the repo, followed the README, picked Claude Sonnet to power our agent and kicked off an open-ended experimentation loop on the Nanochat model to run overnight.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">How it works<\/h2>\n\n\n\n<p class=\"has-medium-font-size\">Every few minutes, the agent runs a quick experiment: it changes one thing about the training process and checks if the model got any better. If it got better, it keeps the change and builds on top of it. If it got worse, it throws the change away and goes back to the best version so far. It just keeps looping like this on its own, slowly nudging the model in a better direction.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Results<\/h2>\n\n\n\n<p class=\"has-medium-font-size\">Overnight, the agent ran 70 experiments on its own and improved the Time-to-GPT-2 metric by <strong>11.26%.<\/strong> The whole run finished by morning and cost about $60 in API calls.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">The agent didn\u2019t just tune the dials of the training process; it also made small architectural changes, explaining the reasoning behind each one along the way. You can push it further too: ask it to do deep research before experimenting, or cite papers to back up its choices.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Session metric<\/strong><\/th><th><strong>Value<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Total cost<\/td><td>$61.69<\/td><\/tr><tr><td>Total duration (API)<\/td><td>2h 18m 3s<\/td><\/tr><tr><td>Total duration (wall)<\/td><td>18h 32m 45s<\/td><\/tr><tr><td>Total code changes<\/td><td>2,335 lines added, 73 lines removed<\/td><\/tr><tr><td>Model<\/td><td>claude-sonnet-4-5<\/td><\/tr><tr><td>Tokens (input \/ output)<\/td><td>38.7k \/ 275.2k<\/td><\/tr><tr><td>Cache (read \/ write)<\/td><td>65.8m \/ 10.1m<\/td><\/tr><\/tbody><\/table><figcaption class=\"wp-element-caption\">Table 1: Key session metrics on the overnight agentic experimentation.<\/figcaption><\/figure>\n\n\n\n<h4 class=\"wp-block-heading has-accent-4-color has-text-color has-link-color wp-elements-eb27800a7d705be45e075483c13196f8\">What Worked<\/h4>\n\n\n\n<p class=\"has-medium-font-size\">Autoresearch\u2019s report felt like an actual researcher had worked on our model all night and left a thorough write-up for us to review.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">The initial logs made us skeptical. They were mostly standard, \u201cold fashioned\u201d parameter tweaks. However, a few runs in, we started getting real architectural changes, each one paired with supportive evidence such as published and prior implementations.<\/p>\n\n\n\n<h4 class=\"wp-block-heading has-accent-4-color has-text-color has-link-color wp-elements-8ac49af6fb57ca023eb9ad9c6b6f4399\">What to watch out for<\/h4>\n\n\n\n<p class=\"has-medium-font-size\">Cost is the obvious one. The agent is constantly calling an LLM, so the bill scales with whatever model you&#8217;ve plugged in &#8211; anywhere from free with open-source models to several thousand dollars overnight if you go with a frontier one.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">If that puts you off, there&#8217;s a <a href=\"https:\/\/github.com\/burtenshaw\/multiautoresearch\">free, multi-agent variant<\/a> that takes a different approach: rather than throwing one expensive model at the problem, it has several cheaper ones collaborate and tries to get most of the way there.<\/p>\n\n\n\n<h4 class=\"wp-block-heading has-accent-4-color has-text-color has-link-color wp-elements-78841bf0fe519ae3a2b70e6e3963f32b\">What&#8217;s reusable<\/h4>\n\n\n\n<p class=\"has-medium-font-size\">Autoresearch isn&#8217;t really a system. It&#8217;s a pattern: a short instruction file, a success metric, an improvement loop. Anything that fits that shape is fair game. The same setup that tuned a language model overnight could be used to optimize a forecasting model scored against held-out data, an LLM prompt scored against an eval set, a trading strategy backtested on historical prices, a piece of code scored by its test suite.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Bottom Line and WPP Applications<\/h2>\n\n\n\n<p class=\"has-medium-font-size\">Autoresearch is a small setup that punches well above its weight. It&#8217;s a low-lift thing to try, and it actually delivers &#8211; we&#8217;re keen to throw it at more problems and see how far it goes.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">This kind of agentic workflow is very familiar to our team at WPP Research. Earlier this year, our <a href=\"https:\/\/research.wpp.com\/pods\/alphaevolve-pod\">AlphaEvolve pod<\/a> and <a href=\"https:\/\/research.wpp.com\/pods\/self-improving-performance-agent-pod\">Self-Improving Performance Agent pod<\/a> explored very similar learning patterns:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"has-medium-font-size\">With AlphaEvolve, we took Google DeepMind&#8217;s framework &#8211; <em>AlphaEvolve<\/em>, which uses Gemini to propose and evolve model architectures by itself &#8211; and turned it loose on actual campaign problems. We saw up to <strong>10% gains in prediction accuracy<\/strong> and <strong>7% in recommendation scores<\/strong> over our baselines, and got there much faster than usual. Details in the <a href=\"https:\/\/research.wpp.com\/pods\/alphaevolve-pod\">technical<\/a> and <a href=\"https:\/\/research.wpp.com\/blog\/cracking-the-code-of-campaign-success-with-googles-alphaevolve-agent\">executive<\/a> write-ups.<\/li>\n\n\n\n<li class=\"has-medium-font-size\">With the Self-Improving Performance Agent, we built our own Prediction Optimization Agent &#8211; a system that turns an influencer post into a plain-language description, predicts how it&#8217;ll perform, and then rewrites its own instructions to get sharper over time. Across a dataset of over <strong>10 million Instagram posts<\/strong>, a fine-tuned DistilBERT predictor reached an <strong>R\u00b2 of 0.80<\/strong>, and the optimization loop kept landing on richer, more predictive descriptions with each round. Details in the <a href=\"https:\/\/research.wpp.com\/pods\/self-improving-performance-agent-pod\">technical doc<\/a> and the <a href=\"https:\/\/research.wpp.com\/blog\/a-self-improving-ai-agent-for-optimizing-and-explaining-media-performance\">executive summary<\/a>.<\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size\">This is the kind of work we love: agentic setups that quietly do the hard part, so we can build better systems, faster.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Autoresearch is an agentic setup that lets an agent run its own experiments: tweaking, testing, and improving any model with no human in the loop. To see whether it actually works, we pointed it at training a small language model (Nanochat). Overnight, it ran 70 experiments and lifted our target metric by 11.26%, for just $62 in API costs. The setup is minimal, and the pattern generalizes well beyond machine learning, to anything with a measurable outcome. Low effort, real results.<\/p>\n","protected":false},"author":16,"featured_media":1459,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"tags":[],"ppma_author":[{"id":16,"display_name":"Andreas Stavrou","first_name":"Andreas","last_name":"Stavrou","nickname":"andreas.stavrou","user_nicename":"andreas-stavrou","user_email":"andreas.stavrou@satalia.com","biographical_info":"Andreas is a Senior Data Scientist at Satalia and part of the WPP Research team. With over a decade of hands-on experience, he has delivered machine learning solutions across retail, online gambling, and credit risk, and now builds AI systems at scale in the global advertising industry.","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/avatar.jpg","job_title":"Senior Data Scientist","is_lead":null,"display_as_researcher":null,"order_priority":null}],"class_list":["post-1458","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry"],"acf":{"related_pods":[1362],"featured":false},"authors":[{"term_id":22,"user_id":16,"is_guest":0,"slug":"andreas-stavrou","display_name":"Andreas Stavrou","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/avatar.jpg","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":"","biographical_info":"Andreas is a Senior Data Scientist at Satalia and part of the WPP Research team. With over a decade of hands-on experience, he has delivered machine learning solutions across retail, online gambling, and credit risk, and now builds AI systems at scale in the global advertising industry."}],"featured_image_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/autoresearch_2.png","featured_image_sizes":{"thumbnail":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/autoresearch_2-150x150.png","medium":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/autoresearch_2-300x300.png","large":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/autoresearch_2.png","full":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/autoresearch_2.png"},"_links":{"self":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/posts\/1458","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/users\/16"}],"replies":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1458"}],"version-history":[{"count":14,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/posts\/1458\/revisions"}],"predecessor-version":[{"id":1478,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/posts\/1458\/revisions\/1478"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/media\/1459"}],"wp:attachment":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1458"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1458"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fppma_author&post=1458"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}