{"id":1584,"date":"2026-06-10T16:07:38","date_gmt":"2026-06-10T16:07:38","guid":{"rendered":"https:\/\/cms.research.wpp.com\/?p=1584"},"modified":"2026-06-11T09:43:55","modified_gmt":"2026-06-11T09:43:55","slug":"learning-through-experience-teaching-the-viral-hermes-agent-to-automate-our-work","status":"publish","type":"post","link":"https:\/\/cms.research.wpp.com\/?p=1584","title":{"rendered":"Learning through experience: teaching the viral Hermes agent to automate our work"},"content":{"rendered":"\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"561\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-10-at-7.06.52-PM-1024x561.png\" alt=\"\" class=\"wp-image-1591\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-10-at-7.06.52-PM-1024x561.png 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-10-at-7.06.52-PM-300x165.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-10-at-7.06.52-PM-768x421.png 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-10-at-7.06.52-PM.png 1262w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Hermes<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading has-large-font-size\"><strong>What is Hermes ?<\/strong><\/h2>\n\n\n\n<p class=\"has-medium-font-size\"><a href=\"https:\/\/github.com\/NousResearch\/hermes-agent\"><strong>Hermes Agent<\/strong><\/a> is an open-source, self-hosted AI agent released by <a href=\"https:\/\/nousresearch.com\">Nous Research<\/a>, the lab behind the Hermes model family under an MIT license. Its main pitch is a built-in <em>learning loop<\/em>. Instead of resetting to zero every session, Hermes runs a post-execution review after each successful task, distills the steps that worked into a reusable, Markdown-defined &#8220;skill&#8221; and refines those skills the next time it hits a similar problem. It also keeps persistent memory across sessions, so it gradually builds a model of your projects and how you like things done, <strong>effectively learning through experience.<\/strong><\/p>\n\n\n\n<p class=\"has-medium-font-size\">Unlike a copilot tethered to an IDE, Hermes is meant to live on a server and run unattended &#8211; a $5 VPS, a GPU box or serverless infra that costs almost nothing when idle. It talks to hundreds of LLMs through the OpenAI-compatible interface, can communicate via Telegram, Discord, Slack, WhatsApp, Signal, email and a CLI, supports natural-language cron for scheduled jobs, and can spin up subagents to parallelize work. It also ships with 40+ built-in skills out of the box.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">What made it especially attractive to us is its ability to write and improve its own playbook. This ability made us think: <em>could it learn how to do our job and fully automate the work we do at the Quick Reactions pod?<\/em><\/p>\n\n\n\n<p class=\"has-medium-font-size\">Our pod\u2019s mission is to pick up state-of-the-art AI tools, experiment with them, and write an honest, evidence-backed assessment.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">What makes our work tricky is the fact that every new tool we evaluate is different. We need to study it, figure out how to set it up, run it, and evaluate the results.<\/p>\n\n\n\n<h2 class=\"wp-block-heading has-large-font-size\"><strong>Our 6-step evaluation protocol<\/strong><\/h2>\n\n\n\n<p class=\"has-medium-font-size\">We begin by encoding our workflow into a strict, 6-step protocol that the agent can follow for every evaluation:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li class=\"has-medium-font-size\"><strong>Workspace initialization<\/strong> &#8211; spin up a clean, isolated project environment so each evaluation starts from a known state.<\/li>\n\n\n\n<li class=\"has-medium-font-size\"><strong>Baseline replication<\/strong> &#8211; run the tool&#8217;s own &#8220;getting started&#8221; examples first, to confirm the headline claims reproduce before we push further.<\/li>\n\n\n\n<li class=\"has-medium-font-size\"><strong>Rigorous verification<\/strong> &#8211; design and run additional autonomous tests that probe the tool under conditions its authors didn&#8217;t pick, rather than extrapolating from the happy path.<\/li>\n\n\n\n<li class=\"has-medium-font-size\"><strong>Data synthesis &amp; metrics control<\/strong> &#8211; measure the things that matter (recall, latency, accuracy) against ground truth data.<\/li>\n\n\n\n<li class=\"has-medium-font-size\"><strong>Adversarial peer review<\/strong> &#8211; hand the findings to a separate agent powered by a powerful LLM (Claude Opus 4.8), to receive feedback and iterate.<\/li>\n\n\n\n<li class=\"has-medium-font-size\"><strong>Finalization &amp; delivery<\/strong> &#8211; format the assessment doc and ship it to the right channel (to our internal <a href=\"https:\/\/www.notion.com\/\">Notion<\/a> knowledge base in our case).<\/li>\n<\/ol>\n\n\n\n<p class=\"has-medium-font-size\">The idea was simple: if these are the steps a human pod member walks through, can an agent walk through them unattended &#8211; and would the writeup at the end be any good?<\/p>\n\n\n\n<h2 class=\"wp-block-heading has-large-font-size\"><strong>The Skill system: how Hermes improves itself<\/strong><\/h2>\n\n\n\n<p class=\"has-medium-font-size\">The first time we ran Hermes on a real task, we walked it through the 6-step protocol by hand. Instead of just following along, Hermes wrote each step down as its own skill &#8211; a short Markdown file it can pull up at the start of any future run. So the protocol stopped being something we had to repeatedly provide as input; it became something the agent already knows.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">Hermes used this knowledge to build a small set of skills covering everything we do: how to set up a clean workspace, how to test a new tool properly, and how to draft the write-up and run it past the reviewers. On top of those sits one master skill that holds the whole run together &#8211; it treats the 6 steps as a checklist and won&#8217;t let Hermes jump ahead before the previous step is actually done.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">Impressively, when the LLM peer reviewer flagged something during an experiment (e.g. an unfair baseline, a missing caveat, a poorly designed experiment), Hermes would learn from the feedback and address the issue. It would also record its new learnings in the related skill files, so the next experiment started from a slightly stronger playbook.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">That&#8217;s the self-improvement loop the Hermes pitch promises, and we actually watched it happen. The more experiments we run, the better the skills get, and the less we need to babysit Hermes for the next one. The underlying LLM that powers Hermes (Gemini 3.1 Pro) isn&#8217;t getting smarter &#8211; its playbook is, and Hermes is the one rewriting it.<\/p>\n\n\n\n<h2 class=\"wp-block-heading has-large-font-size\"><strong>Putting Hermes to the test<\/strong><\/h2>\n\n\n\n<p class=\"has-medium-font-size\">We gave Hermes a single, real assignment: take a brand-new open-source tool called <a href=\"https:\/\/github.com\/RyanCodrai\/turbovec\">Turbovec<\/a> &#8211; which claims to store huge amounts of data in a tiny amount of memory and search it faster than the popular alternative &#8211; and find out whether those claims actually hold up.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">We handed the agent the tool, its documentation, a bare cloud machine to work, and nothing else: no starter code, no template, no outline. Hermes had to decide what and how to test, run the experiments and, write the whole thing up on its own.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">We reviewed Hermes\u2019 output in the exact same way we would review a human colleague&#8217;s work, via three simple questions:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"has-medium-font-size\"><strong><em>Did it manage to set up the tool and run experiments?<\/em> Yes!<\/strong> It wasn&#8217;t all smooth sailing. Hermes&#8217;s first pass used the wrong settings. One of the integrations that TurboVec advertised also didn&#8217;t work on the first try &#8211; a common challenge we face in the Quick Reactions Pod. However, rather than getting blocked by these stumbles, the agent noticed them, fixed them and left a clear trail of what went wrong and how it was corrected &#8211; exactly the kind of thing a rushed human reviewer might quietly skip over.<\/li>\n\n\n\n<li class=\"has-medium-font-size\"><em><strong>Did it design a fair test?<\/strong><\/em> Yes! It first reproduced the tool authors&#8217; own results, then set up an even-handed comparison against the leading alternative tool (<a href=\"https:\/\/github.com\/facebookresearch\/faiss\">FAISS<\/a>), as a baseline. It was also careful enough to optimize the baseline tool\u2019s configuration (rather than making it deliberately weak one), ensuring that the contest wasn&#8217;t rigged in Turbovec&#8217;s favor.<\/li>\n\n\n\n<li class=\"has-medium-font-size\"><em><strong>Did it get the numbers right?<\/strong><\/em> Yes, with a bit of extra AI help. For inststance, its first attempt used a small data sample and then just assumed the results would scale up neatly. The Adversarial peer-review step (step 5 in our 6-step workflow) caught that this assumption was unsafe. Hermes accepted the criticism and re-ran the full-size test. The adversarial reviewer turned out to be right &#8211; using the small sample would have significantly skewed the results.<\/li>\n\n\n\n<li class=\"has-medium-font-size\"><strong>Was the writeup appropriate?<\/strong> Yes, with a bit of extra AI help. Hermes\u2019 original draft omitted some critical details and also inflated some of the findings. Thankfully, The LLM reviewer\u2019s feedback in step 5 also ensured that claims got toned down to what the data actually supported.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading has-large-font-size\"><strong>So is it worth it?<\/strong><\/h2>\n\n\n\n<p class=\"has-medium-font-size\">Very promising. Left alone with a new GitHub repo and a blank machine, Hermes handled the mechanical, time-consuming work on its own: it downloaded the repo, installed everything, and ran some initial tests to make sure everything runs.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">Even though it did stumble during the actual experimental design and assessment of the tool (TurboVec), the introduction of our second Reviewer agent was enough to address these issues and deliver, with no human in the loop.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">Obviously this is just a single &#8211; albeit very promising &#8211; piece of evidence. We will keep pushing the limits of Hermes in the context of our pod\u2019s work, with the intent to automate and scale-up our assessment efforts as much as possible.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">Another angle we plan to explore is cost minimization. This first experiment showed the effectiveness of the iterative, dual-agent architecture:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"has-medium-font-size\">An affordable Gemini-powered Hermes to actually do the heavy lifting (open-ended, token-heavy)<\/li>\n\n\n\n<li class=\"has-medium-font-size\">A more expensive Opus-powered Reviewer to review the report after each iteration and provide feedback (single-shot, token-lean)<\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size\">A key question &#8211; and one that we keep facing in WPP Research &#8211; is: <em><strong>what is the cheapest LLM brain that we could use for each agent, while maintaining quality outcome?<\/strong><\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Hermes is a popular, open-source agent by Nous Research. Its standout feature is a learning loop: after each task, it distills what worked into reusable, Markdown-defined &#8220;skills&#8221; and carries memory across sessions. We tested whether it could do the work of our Quick Reactions pod: evaluating new AI tools and publishing short, honest, evidence-backed assessments for teams across WPP. We encoded our workflow into a strict 6-step protocol and gave it to Hermes along with the GitHub repo of a new tool: Turbovec, a vector database claiming to pack huge data into a tiny memory footprint. Hermes handled all the groundwork (setup, baseline replication, checks, drafting), designed a reasonable test, and wrote an honest report. However, it did make configuration and experimental-design mistakes that compromised its findings. To fix this while keeping things fully agentic, we added a &#8220;Reviewer&#8221; agent powered by a powerful LLM. Through multiple feedback iterations, and aided by Hermes&#8217; ability to learn from experience, the two converged on a solid recipe that delivered a high-quality assessment.<\/p>\n","protected":false},"author":5,"featured_media":1591,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"tags":[],"ppma_author":[{"id":5,"display_name":"Ted Lappas","first_name":"Ted","last_name":"Lappas","nickname":"theodoros.lappas","user_nicename":"theodoros-lappas","user_email":"Theodoros.Lappas@wpp.com","biographical_info":"Ted co-leads WPP Research and serves as Head of Data Science at Satalia, co-founder of Conscium, and Assistant Professor in the Department of Marketing and Communication at the Athens University of Economics and Business. His research spans scalable algorithms for multimodal data, synthetic data generation, simulation-based verification for AI agents, and information diffusion and collective intelligence in expert networks. He publishes regularly in top-tier computer science and business venues.","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/04\/pic.png","job_title":"Head of Data Science","is_lead":false,"display_as_researcher":true,"order_priority":1},{"id":15,"display_name":"Nikos Gkikizas","first_name":"Nikos","last_name":"Gkikizas","nickname":"nikos.gkikizas","user_nicename":"nikos-gkikizas","user_email":"nikos.gkikizas@satalia.com","biographical_info":"Nikos is a Staff Data Scientist at Satalia. He started off as a mining engineer and later specialized in explosives and blasting engineering. Through signal processing he was introduced to data science, with which he gradually fell in love with. Over the last decade he's been working in Data Science, having developed various simulation, predictive and optimization solutions.","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/04\/signal-2026-04-01-111218.jpeg","job_title":"Staff Data Scientist","is_lead":false,"display_as_researcher":true,"order_priority":null},{"id":16,"display_name":"Andreas Stavrou","first_name":"Andreas","last_name":"Stavrou","nickname":"andreas.stavrou","user_nicename":"andreas-stavrou","user_email":"andreas.stavrou@satalia.com","biographical_info":"Andreas is a Senior Data Scientist at Satalia and part of the WPP Research team. With over a decade of hands-on experience, he has delivered machine learning solutions across retail, online gambling, and credit risk, and now builds AI systems at scale in the global advertising industry.","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/avatar.jpg","job_title":"Senior Data Scientist","is_lead":null,"display_as_researcher":null,"order_priority":null}],"class_list":["post-1584","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry"],"acf":{"related_pods":[1362],"featured":false},"authors":[{"term_id":36,"user_id":5,"is_guest":0,"slug":"theodoros-lappas","display_name":"Theodoros Lappas","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/04\/pic.png","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":"","biographical_info":"Ted co-leads WPP Research and serves as Head of Data Science at Satalia, co-founder of Conscium, and Assistant Professor in the Department of Marketing and Communication at the Athens University of Economics and Business. His research spans scalable algorithms for multimodal data, synthetic data generation, simulation-based verification for AI agents, and information diffusion and collective intelligence in expert networks. He publishes regularly in top-tier computer science and business venues."},{"term_id":31,"user_id":15,"is_guest":0,"slug":"nikos-gkikizas","display_name":"Nikos Gkikizas","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/04\/signal-2026-04-01-111218.jpeg","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":"","biographical_info":"Nikos is a Staff Data Scientist at Satalia. He started off as a mining engineer and later specialized in explosives and blasting engineering. Through signal processing he was introduced to data science, with which he gradually fell in love with. Over the last decade he's been working in Data Science, having developed various simulation, predictive and optimization solutions."},{"term_id":22,"user_id":16,"is_guest":0,"slug":"andreas-stavrou","display_name":"Andreas Stavrou","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/avatar.jpg","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":"","biographical_info":"Andreas is a Senior Data Scientist at Satalia and part of the WPP Research team. With over a decade of hands-on experience, he has delivered machine learning solutions across retail, online gambling, and credit risk, and now builds AI systems at scale in the global advertising industry."}],"featured_image_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-10-at-7.06.52-PM-1024x561.png","featured_image_sizes":{"thumbnail":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-10-at-7.06.52-PM-150x150.png","medium":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-10-at-7.06.52-PM-300x165.png","large":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-10-at-7.06.52-PM-1024x561.png","full":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/06\/Screenshot-2026-06-10-at-7.06.52-PM.png"},"_links":{"self":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/posts\/1584","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/users\/5"}],"replies":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1584"}],"version-history":[{"count":16,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/posts\/1584\/revisions"}],"predecessor-version":[{"id":1602,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/posts\/1584\/revisions\/1602"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/media\/1591"}],"wp:attachment":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1584"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1584"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fppma_author&post=1584"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}