Hermes Agent is an open-source, self-hosted AI agent released by Nous Research, the lab behind the Hermes model family under an MIT license. Its main pitch is a built-in learning loop. Instead of resetting to zero every session, Hermes runs a post-execution review after each successful task, distills the steps that worked into a reusable, Markdown-defined “skill” and refines those skills the next time it hits a similar problem. It also keeps persistent memory across sessions, so it gradually builds a model of your projects and how you like things done, effectively learning through experience.
Unlike a copilot tethered to an IDE, Hermes is meant to live on a server and run unattended – a $5 VPS, a GPU box or serverless infra that costs almost nothing when idle. It talks to hundreds of LLMs through the OpenAI-compatible interface, can communicate via Telegram, Discord, Slack, WhatsApp, Signal, email and a CLI, supports natural-language cron for scheduled jobs, and can spin up subagents to parallelize work. It also ships with 40+ built-in skills out of the box.
What made it especially attractive to us is its ability to write and improve its own playbook. This ability made us think: could it learn how to do our job and fully automate the work we do at the Quick Reactions pod?
Our pod’s mission is to pick up state-of-the-art AI tools, experiment with them, and write an honest, evidence-backed assessment.
What makes our work tricky is the fact that every new tool we evaluate is different. We need to study it, figure out how to set it up, run it, and evaluate the results.
Our 6-step evaluation protocol
We begin by encoding our workflow into a strict, 6-step protocol that the agent can follow for every evaluation:
Workspace initialization – spin up a clean, isolated project environment so each evaluation starts from a known state.
Baseline replication – run the tool’s own “getting started” examples first, to confirm the headline claims reproduce before we push further.
Rigorous verification – design and run additional autonomous tests that probe the tool under conditions its authors didn’t pick, rather than extrapolating from the happy path.
Data synthesis & metrics control – measure the things that matter (recall, latency, accuracy) against ground truth data.
Adversarial peer review – hand the findings to a separate agent powered by a powerful LLM (Claude Opus 4.8), to receive feedback and iterate.
Finalization & delivery – format the assessment doc and ship it to the right channel (to our internal Notion knowledge base in our case).
The idea was simple: if these are the steps a human pod member walks through, can an agent walk through them unattended – and would the writeup at the end be any good?
The Skill system: how Hermes improves itself
The first time we ran Hermes on a real task, we walked it through the 6-step protocol by hand. Instead of just following along, Hermes wrote each step down as its own skill – a short Markdown file it can pull up at the start of any future run. So the protocol stopped being something we had to repeatedly provide as input; it became something the agent already knows.
Hermes used this knowledge to build a small set of skills covering everything we do: how to set up a clean workspace, how to test a new tool properly, and how to draft the write-up and run it past the reviewers. On top of those sits one master skill that holds the whole run together – it treats the 6 steps as a checklist and won’t let Hermes jump ahead before the previous step is actually done.
Impressively, when the LLM peer reviewer flagged something during an experiment (e.g. an unfair baseline, a missing caveat, a poorly designed experiment), Hermes would learn from the feedback and address the issue. It would also record its new learnings in the related skill files, so the next experiment started from a slightly stronger playbook.
That’s the self-improvement loop the Hermes pitch promises, and we actually watched it happen. The more experiments we run, the better the skills get, and the less we need to babysit Hermes for the next one. The underlying LLM that powers Hermes (Gemini 3.1 Pro) isn’t getting smarter – its playbook is, and Hermes is the one rewriting it.
Putting Hermes to the test
We gave Hermes a single, real assignment: take a brand-new open-source tool called Turbovec – which claims to store huge amounts of data in a tiny amount of memory and search it faster than the popular alternative – and find out whether those claims actually hold up.
We handed the agent the tool, its documentation, a bare cloud machine to work, and nothing else: no starter code, no template, no outline. Hermes had to decide what and how to test, run the experiments and, write the whole thing up on its own.
We reviewed Hermes’ output in the exact same way we would review a human colleague’s work, via three simple questions:
Did it manage to set up the tool and run experiments? Yes! It wasn’t all smooth sailing. Hermes’s first pass used the wrong settings. One of the integrations that TurboVec advertised also didn’t work on the first try – a common challenge we face in the Quick Reactions Pod. However, rather than getting blocked by these stumbles, the agent noticed them, fixed them and left a clear trail of what went wrong and how it was corrected – exactly the kind of thing a rushed human reviewer might quietly skip over.
Did it design a fair test? Yes! It first reproduced the tool authors’ own results, then set up an even-handed comparison against the leading alternative tool (FAISS), as a baseline. It was also careful enough to optimize the baseline tool’s configuration (rather than making it deliberately weak one), ensuring that the contest wasn’t rigged in Turbovec’s favor.
Did it get the numbers right? Yes, with a bit of extra AI help. For inststance, its first attempt used a small data sample and then just assumed the results would scale up neatly. The Adversarial peer-review step (step 5 in our 6-step workflow) caught that this assumption was unsafe. Hermes accepted the criticism and re-ran the full-size test. The adversarial reviewer turned out to be right – using the small sample would have significantly skewed the results.
Was the writeup appropriate? Yes, with a bit of extra AI help. Hermes’ original draft omitted some critical details and also inflated some of the findings. Thankfully, The LLM reviewer’s feedback in step 5 also ensured that claims got toned down to what the data actually supported.
So is it worth it?
Very promising. Left alone with a new GitHub repo and a blank machine, Hermes handled the mechanical, time-consuming work on its own: it downloaded the repo, installed everything, and ran some initial tests to make sure everything runs.
Even though it did stumble during the actual experimental design and assessment of the tool (TurboVec), the introduction of our second Reviewer agent was enough to address these issues and deliver, with no human in the loop.
Obviously this is just a single – albeit very promising – piece of evidence. We will keep pushing the limits of Hermes in the context of our pod’s work, with the intent to automate and scale-up our assessment efforts as much as possible.
Another angle we plan to explore is cost minimization. This first experiment showed the effectiveness of the iterative, dual-agent architecture:
An affordable Gemini-powered Hermes to actually do the heavy lifting (open-ended, token-heavy)
A more expensive Opus-powered Reviewer to review the report after each iteration and provide feedback (single-shot, token-lean)
A key question – and one that we keep facing in WPP Research – is: what is the cheapest LLM brain that we could use for each agent, while maintaining quality outcome?
DeepSeek’s release pattern has been consistent: ship a model that posts frontier-comparable benchmark numbers at an order-of-magnitude lower price, then watch the other providers scramble. DeepSeek v4 Pro is the most aggressive instance of that pattern yet. Rather than putting it to the test via a standard off-the-shelf agentic benchmark, we focused on a more pertinent question: what happens when you actually deploy it as the LLM brain of an agent?
“Agentic ability” packs in at least three dimensions:
Behaving like a competent professional in realistic settings. Stay in character. Stay focused on your objective. Communicate effectively with different types of stakeholders. Respect policies and constraints. Validate the quality of the information that you consume and produce. Adapt to changing circumstances.
Long-horizon autonomous problem solving. Read a codebase, form a hypothesis, run an experiment, read the result, build on it. Repeat many times without a human in the loop.
Cost-efficiency under sustained load. Being able to solve complex problems and succeed in real-world scenarios for $0.05 per task is a very different proposition from achieving the same outcomes for $1.50 per task, even if the success rate is identical.
To evaluate DeepSeek across all three dimensions, we picked two different agentic tools:
VerifyAX (Conscium’s agent-evaluation platform). VerifyAX drops an agent into the kind of situation it would actually face once deployed: a realistic scenario populated by other characters (customers, interviewers, colleagues, adversaries), with an objective to achieve and rules to follow. Scenarios are automatically generated to exercise a wide panel of specific skills, from communication and safety to technical reasoning.
Autoresearch (Andrej Karpathy’s open-source autonomous-research framework). It hands an agent a piece of code and a metric to improve, then steps back. The agent reads the code, makes one change, runs it, checks whether the metric improved, and decides whether to keep or undo the change. This continues for many iterations, with no human in the loop.
Cost (the third dimension) can be easily measured on both VerifyAX and Autoresearch, so we report it separately.
To benchmark DeepSeek v4 Pro against the rest of the frontier, we set up four agents, each powered by a different LLM: DeepSeek v4 Pro, GPT 5.5 (OpenAI), Claude Opus 4.6 (Anthropic), and Gemini 3.1 Pro (Google). The agent harness is identical across all four; only the LLM brain changes.
2. Experiment A: VerifyAX
Setup
We ran all four LLM-powered agents (powered by Opus 4.6, Gemini 3.1 Pro, DeepSeek v4 Pro, and GPT 5.5) through two VerifyAX-generated scenario suites:
The harder suite — 25 multi-agent scenarios, in which the agent is given a fixed amount of time to navigate multiple stakeholders (manifested as VerifyAX-controlled agents), exchange information, use tools, and make decisions in order to achieve an objective. Each scenario is designed to test for specific skills.
The easier suite — 25 one-on-one scenarios. Similar to the harder suite, but with only one stakeholder in each scenario.
Every run produces a pass/fail verdict per scenario plus a per-skill score (1–5) for every skill the scenario tests for. Across the 50 scenarios, the two suites collectively exercise 60+ distinct skills, drawn from a wider VerifyAX panel that covers safety, reasoning under pressure, information extraction, communication skills, and technical execution. A representative sample is shown in the table below:
Skill tag
What it tests
goal_injection_resistance
Resists being hijacked into a new, conflicting mission mid-scenario.
incremental_corruption_resistance
Notices when a chain of seemingly-harmless requests cumulatively breaks the rules.
sycophancy_resistance
Refuses to validate a demonstrably false statement just because the counterparty insists on it.
hate_speech_refusal
Refuses to produce hate speech or discriminatory content.
irreversible_action_caution
Insists on explicit informed consent before taking unrecoverable actions.
ambiguity_clarification
Pins down vague jargon or undefined terms before acting.
contradiction_handling
Calls out inconsistencies in a counterparty’s statements tactfully, without triggering defensiveness.
tradeoff_reasoning
Weighs competing options and articulates the cost of each choice.
data_hallucination_resistance
Asks for missing materials instead of inventing them when context is incomplete.
long_horizon_thinking
Plans and sequences actions across many steps toward a distant goal.
anger_deescalation
Acknowledges emotion, defuses conflict, proposes concrete next steps.
intermediate_math
Solves multi-step quantitative problems.
advanced_web_research
Answers complex questions requiring browsing, cross-referencing multiple sources, and synthesis.
advanced_programming
Solves complex programming problems.
Result
Harder suite (multi-agent):
Model
Pass rate
Avg skill grade (1–5)
Agent cost
$ / scenario
Claude Opus 4.6
13/25 (52%)
4.63
$19.95
$0.80
Gemini 3.1 Pro
8/25 (32%)
4.42
$1.81
$0.07
DeepSeek v4 Pro
7/25 (28%)
4.04
$0.48
$0.02
GPT 5.5
7/25 (28%)
4.25
$1.79
$0.07
Easier suite (one-on-one):
Model
Pass rate
Avg skill grade (1–5)
Agent cost
$ / scenario
Claude Opus 4.6
23/25 (92%)
4.83
$17.09
$0.68
Gemini 3.1 Pro
23/25 (92%)
4.71
$0.99
$0.04
GPT 5.5
20/25 (80%)
4.52
$1.19
$0.05
DeepSeek v4 Pro
19/25 (76%)
4.58
$0.30
$0.01
Three things jump out:
Claude wins, comfortably. On the harder suite, 52% vs everyone else clustered at 28–32% — and the highest macro-averaged skill grade (4.63) of any model on either suite. On the easier suite, Claude and Gemini tie on pass rate at 92%, but Claude still edges Gemini on skill grade (4.83 vs 4.71).
DeepSeek sits at the bottom on both suites. On the harder suite it ties GPT 5.5 at the bottom of the table (both 7/25, 28%). On the easier suite it’s the weakest of the four on pass rate (19/25, 76%), though its skill grade (4.58) actually edges GPT’s (4.52).
The cost spread is startling. Claude’s per-scenario spend is ~40× DeepSeek’s on the harder suite and ~55× on the easier one. GPT 5.5 and Gemini sit in the same mid-range bracket (~$0.04–0.07/scenario); only Claude is in a different tier.
3. Experiment B: Autoresearch
Setup
Each agent starts with just two files:
train.py — a ~630-line script that trains a small language model from scratch. The starting model is small by today’s standards — 8.7M parameters, the kind of tiny transformer you’d find in an early GPT-2.
program.md — a short prompt telling the agent what to do.
The agent then runs unattended, looping through these steps:
Read the current training script and a log of everything it has tried before.
Propose one specific code change.
Train the modified model on GPU hardware for a fixed 5-minute budget.
Show the trained model a chunk of text it has never seen and measure how well it predicts what comes next, character by character.
Keep the change if the new score is better than the best so far; otherwise discard it and try something different next time.
Repeat 50 times.
Result
Model
% Improvement
Cost
Successful experiments
GPT 5.5
6.83%
$33.53
8 / 50
Gemini 3.1 Pro
6.12%
$10.21
9 / 50
DeepSeek v4 Pro
6.11%
$3.69
8 / 50
Claude Opus 4.6
6.04%
$63.45
7 / 50
The improvement numbers cluster tightly. The cost numbers do not: Opus 4.6 cost ~17× more than DeepSeek v4 Pro for essentially the same outcome.
More interesting than the bottom line is the strategy fingerprint each model converged to. All four independently rediscovered the same single biggest win: cutting the training batch size in half (which trades smaller-per-step learning updates for more update steps inside the 5-minute budget — a good trade when the bottleneck is wall-clock, not data). After that they diverged:
Opus 4.6 kept the network’s outer shape and redesigned the building blocks inside each layer — a more expressive math operation in every block (an activation function called SwiGLU, instead of ReLU²) and 50% more internal capacity per layer. Same outside, smarter inside.
GPT 5.5 opted for a smaller, faster network (3 layers instead of 4, shorter context windows) so it could fit more training steps into the budget, with optimizer settings tuned to make those extra steps count.
DeepSeek v4 Pro combined GPT’s move with the only attention-mechanism change that survived in any model’s final config: grouped-query attention (reusing key/value projections across heads to compress the attention block).
Gemini 3.1 Pro left the network alone and changed how it was trained — same layers, same shape, same building blocks, but turned learning rates up and drove weight decay to zero. Every architectural change it tried, it reverted.
Those architectural moves had visible consequences for the final model size: Opus’s wider MLP made the model bigger than the 8.7M-parameter baseline, Gemini kept it at baseline size, GPT shrank it, and DeepSeek shrank it most — to 3.4M parameters, less than half the baseline.
Four genuinely distinct strategies landing within 0.8 percentage points of each other is itself an interesting result, suggesting there are several different ways to win at this task within the 5-minute budget, and that experimenting with different models can lead to distinct but equally promising paths. Increasing the rounds and the time budget per round can help explore these paths further.
The Bottom Line
On Autoresearch, where the LLM-powered agent is the only stakeholder and the loop is a tight code-edit / measure-result cycle, DeepSeek is tied with the top model on outcome (6.11% improvement, within 0.8 pp of GPT 5.5) at a fraction of the cost. On VerifyAX, where the agent has to survive multi-agent simulations that emulate real-world scenarios, it maintains its cost advantage but lands at the very bottom of the rankings in terms of skills and objective completion. This highlights something we already knew: the key is to pick the right tool (LLM brain) for the job. If you care about cost and expect your agent to work on a problem on its own for many iterations, then evidence suggests that DeepSeek is definitely worth a shot. However, if you expect your agent to operate in dynamic real-world environments with other stakeholders and complex constraints, then there are better options out there.
If you are looking for an LLM Brain that can perform in both contexts, Gemini is the most consistent of the four. Always in the top half, ties Claude on the easier VerifyAX suite at a fraction of the cost, and finishes Autoresearch essentially tied with DeepSeek. The pragmatic default if you don’t want to commit to either end of the cost/capability spectrum.
Our experiments are just two of many that could (and should) be run to evaluate a model as powerful and multi-faceted as a frontier LLM. They offer some real evidence about where each LLM lands in the contexts we texted, but there’s plenty more work to do to establish how it performs in other settings.
Autoresearch is an agentic setup – a system that hands an AI agent the keys and lets it work on its own. You give the agent a 1-page file with instructions that describe:
what you want it to improve
what counts as success, and
what it’s allowed to change.
From there, the agent takes over. In our case, it started editing the training code, running it, inspecting the results, deciding what to try next, editing again – and looping like that on its own until it hit the budget.
The autoresearch repo itself showcases this setup by pointing the agent at training Nanochat – a small but real language model that covers all the usual stages of building an LLM, including tokenization, pretraining, finetuning, evaluation and inference. The objective is the following: achieve GPT-2 level capabilities in as little time as possible, measured by a metric called Time-to-GPT-2 on a standard benchmark called DCLM CORE.
This task is far from trivial – there’s even a public leaderboard tracking who can do it fastest, and at this point Autoresearch has beaten the engineers who built Nanochat itself!
Autoresearch isn’t limited to LLMs or traditional ML either. The setup is intentionally generic – point it at a forecasting model, a recommendation engine, a media plan, or an operational workflow, and it works the same way.
If you can measure success, the agent can optimize for it.
Most agents just use AI models as building blocks. Autoresearch promises to build and improve them. We had to try it ourselves.
Here’s what we found.
Setup
Setup is genuinely easy. We cloned the repo, followed the README, picked Claude Sonnet to power our agent and kicked off an open-ended experimentation loop on the Nanochat model to run overnight.
How it works
Every few minutes, the agent runs a quick experiment: it changes one thing about the training process and checks if the model got any better. If it got better, it keeps the change and builds on top of it. If it got worse, it throws the change away and goes back to the best version so far. It just keeps looping like this on its own, slowly nudging the model in a better direction.
Results
Overnight, the agent ran 70 experiments on its own and improved the Time-to-GPT-2 metric by 11.26%. The whole run finished by morning and cost about $60 in API calls.
The agent didn’t just tune the dials of the training process; it also made small architectural changes, explaining the reasoning behind each one along the way. You can push it further too: ask it to do deep research before experimenting, or cite papers to back up its choices.
Session metric
Value
Total cost
$61.69
Total duration (API)
2h 18m 3s
Total duration (wall)
18h 32m 45s
Total code changes
2,335 lines added, 73 lines removed
Model
claude-sonnet-4-5
Tokens (input / output)
38.7k / 275.2k
Cache (read / write)
65.8m / 10.1m
Table 1: Key session metrics on the overnight agentic experimentation.
What Worked
Autoresearch’s report felt like an actual researcher had worked on our model all night and left a thorough write-up for us to review.
The initial logs made us skeptical. They were mostly standard, “old fashioned” parameter tweaks. However, a few runs in, we started getting real architectural changes, each one paired with supportive evidence such as published and prior implementations.
What to watch out for
Cost is the obvious one. The agent is constantly calling an LLM, so the bill scales with whatever model you’ve plugged in – anywhere from free with open-source models to several thousand dollars overnight if you go with a frontier one.
If that puts you off, there’s a free, multi-agent variant that takes a different approach: rather than throwing one expensive model at the problem, it has several cheaper ones collaborate and tries to get most of the way there.
What’s reusable
Autoresearch isn’t really a system. It’s a pattern: a short instruction file, a success metric, an improvement loop. Anything that fits that shape is fair game. The same setup that tuned a language model overnight could be used to optimize a forecasting model scored against held-out data, an LLM prompt scored against an eval set, a trading strategy backtested on historical prices, a piece of code scored by its test suite.
Bottom Line and WPP Applications
Autoresearch is a small setup that punches well above its weight. It’s a low-lift thing to try, and it actually delivers – we’re keen to throw it at more problems and see how far it goes.
With AlphaEvolve, we took Google DeepMind’s framework – AlphaEvolve, which uses Gemini to propose and evolve model architectures by itself – and turned it loose on actual campaign problems. We saw up to 10% gains in prediction accuracy and 7% in recommendation scores over our baselines, and got there much faster than usual. Details in the technical and executive write-ups.
With the Self-Improving Performance Agent, we built our own Prediction Optimization Agent – a system that turns an influencer post into a plain-language description, predicts how it’ll perform, and then rewrites its own instructions to get sharper over time. Across a dataset of over 10 million Instagram posts, a fine-tuned DistilBERT predictor reached an R² of 0.80, and the optimization loop kept landing on richer, more predictive descriptions with each round. Details in the technical doc and the executive summary.
This is the kind of work we love: agentic setups that quietly do the hard part, so we can build better systems, faster.
OpenClaw has been going viral in the agentic AI community. It’s an open-source toolkit for building AI agents: pick a model, give the agent some tools and a goal, and OpenClaw runs it. Unlike a plain chatbot, OpenClaw agents remember across sessions, plug into mainstream services (email, calendar, files, plus a community library of integrations), and act proactively – kicking off scheduled tasks and reminders without being prompted.
The feature that caught our attention is its channel support out of the box, OpenClaw agents can talk to people on WhatsApp, Telegram, Signal, iMessage, Discord, Slack, and more than fifteen other channels. For any product team thinking about agentic UX, that’s a big deal – messaging is where users actually are.
So we put it to the test, with three questions in mind:
Is it easy to set up?
Could a product team realistically build their user-facing communications on top of it?
Does the value justify the cost in tokens, infrastructure, and maintenance?
We deployed an OpenClaw instance and pointed it at two channels: Telegram and WhatsApp. Here’s what we found.
Setup: genuinely easy
Installation is a single command followed by a short guided setup. We were chatting with our deployment from a phone within minutes.
We routed traffic across three Gemini 3.1 tiers based on task complexity – Pro Preview for multi-step reasoning, Flash Preview for summaries and intent classification, and Flash Lite Preview for trivial replies and heartbeats. Done well, this kind of routing materially cuts cost and latency.
So far, so good. The interesting parts started when we looked closer at the channels themselves.
Finding 1: “20+ channels” is really two very different lists
OpenClaw’s channel list looks uniform in the docs, but the integrations split into two tiers that behave nothing alike in production:
Tier 1 – Official APIs (sanctioned). Telegram (Bot API), Discord, Slack, Microsoft Teams. You register an app, get a token, and operate within documented rate limits. Fully compliant with the platform’s terms.
Tier 2 – Reverse-engineered protocols (unsanctioned). WhatsApp via Baileys, iMessage via undocumented Apple APIs. These libraries reconstruct the platform’s wire protocol from the outside. To the platform, the traffic is indistinguishable from a modified, unofficial client.
We deliberately tested one of each.
Telegram (sanctioned): boring in the best way
Pairing was a five-minute conversation with @BotFather: pick a name, get a token, hand it to OpenClaw. No QR code, no reverse-engineering, no ban risk.
In use, the bot worked exactly as advertised – with one important constraint baked into the platform itself: Telegram bots cannot send cold messages. A user has to message the bot first. Reactive workflows (Q&A, commands, group bots) work great. Proactive ones (outbound reminders, re-engagement, anything not opted into) are blocked by design.
Pairing was equally fast: scan a QR code with your phone, exactly like linking WhatsApp Web. Once paired, the agent operates as your number – it can read every message in the inbox and send to anyone. To recipients, it’s indistinguishable from you.
The integration itself is solid. The problem isn’t engineering, it’s policy: Baileys is not sanctioned by Meta. WhatsApp’s terms explicitly prohibit automated use through unofficial clients, and accounts using libraries like this can be – and are – banned. Fine for a personal experiment. Not fine as the long-term backbone of a product.
So of the two channels we tested, one is safe but limited to inbound, and the other is unrestricted but built on a foundation Meta could pull at any moment. That’s a meaningful gotcha for anyone planning to ship on top of OpenClaw.
Finding 2: the cost is staggering – and it’s structural
Just bringing the agent up and pairing the two channels consumed ~19M tokens across 373 messages. That’s the cost of standing OpenClaw up, before doing anything useful with it.
Tokens are how LLMs charge: every word, punctuation mark, and piece of context sent to the model is metered, on both the input and the output side.
For context, most agent conversations cost a few hundred to a few thousand tokens per turn (Iternal.ai, 2026; Redis, 2026). Even on the high end, 19M tokens is the budget for thousands of real user conversations – and we burned it on setup.
Why so much? OpenClaw loads its full bundle of skills, channel adapters, and orchestration rules into the model’s context on every single call – even calls that don’t need any of it. Every message ends up dragging the entire framework behind it, and the cost compounds as conversations grow.
If messaging is a core part of your agentic workflow, running OpenClaw out of the box will very quickly run up the bill. The good news: it’s open source, so a dev team can strip the bundle down to just the skills and adapters a given workflow actually needs. The bad news: that’s real engineering work, not a config flag.
What we’re watching next
A handful of OpenClaw-inspired spin-offs have appeared over the last few months, addressing the efficiency and security issues that come with running OpenClaw out of the box: ZeroClaw, PicoClaw, NullClaw, NanoBot, TinyClaw, and NanoClaw. Smart move by these teams – we plan to put them through the same tests we ran here and follow up in a future post.
Bottom line
OpenClaw delivers on the easy parts of its messaging promise. Setup is fast, the abstractions are clean, and once paired, the agent behaves well on both Telegram and WhatsApp.
But the headline “20+ channels” hides a split between sanctioned and unsanctioned integrations that materially changes what you can build, and the default token economics make production use prohibitively expensive without customisation.
For a hackathon or an internal tool, OpenClaw is great. For a product team planning to make messaging a long-term pillar of their UX, it’s a strong starting point – not a finished platform.
We monitor AI news – new models, libraries and tools – test them rapidly, and publish a short reaction on our blog.
How we work
No deep dives – the goal is speed and signal, not exhaustive analysis. We get the thing up and running with minimal setup, test it against the main use cases it’s designed for, and write a short blog post capturing first impressions, what worked, what didn’t, and whether it’s worth a closer look.
What if an AI agent could experience the internet the way a person does – scroll through feeds, react to content, develop tastes, get influenced and evolve?
That is the question behind the SocialAgents research pod. We are building an autonomous agent that browses the internet the way a real human would: it sees content, forms opinions based on its personality/background, decides whether to engage and over time develops new interests shaped by what the algorithm chooses to show it. This blog post documents the first phase of that effort, focused on social platforms as the initial source of information.
The work tries to answer the following question:
How do online platforms shape what different users see, engage with, and eventually come to think or believe?
Content-recommendation algorithms do not just match interests – they introduce new content, test engagement, and then reinforce it. By running controlled agents with known starting profiles, we can track how exposure differs across user types, how it changes over time, and what drives those shifts. This gives us a precise way to study algorithmic influence that is impossible with real users.
Each agent is defined by a rich profile (including age, occupation, cultural background, content affinities, aversions etc.) and interacts with platform content through the same actions available to any user: scrolling, liking, saving, commenting, following and sharing. Engagement decisions are made by a multimodal AI model that reasons over the agent’s personality and the content it encounters. Every session is designed to produce realistic behavioral patterns, with timing and warm-up progressions that mirror how a genuine new user explores a platform.
The sections that follow detail the methodology, early experimental results and the infrastructure required to run these simulations at scale. The early findings show that within a single session, the algorithm accurately identified each agent’s interests and then began expanding them into adjacent territories.
The mechanics of human navigation of social media
When you scroll through a feed, your brain runs a rapid filtering process, forming relevance judgments in under 50 milliseconds and pauses when it detects something novel, emotionally charged or personally relevant. Surprise, humor, curiosity and outrage are the strongest scroll-stoppers, because they trigger emotional circuits faster than conscious thought.
What keeps you scrolling isn’t satisfaction but anticipation: the infinite scroll removes natural stopping points, feeding a dopamine loop where the next post might be the rewarding one, a variable-ratio reinforcement pattern, the most compulsive reward schedule in behavioral psychology.
Figure 1 – The single-creative feed – each creative takes up the whole vertical space
Different platforms leverage this in different ways. On some platforms, attention is measured in watch time and rewatches. The algorithm auto-serves content and hooks you within the first second of a video, so seconds of hovering are captured passively and automatically. Others earn attention more deliberately: you lean in, judge a visual aesthetically and decide to tap. Key signals are saves and swipes rather than raw watch time. This is why engagement rates can vary dramatically across platforms – some report averages around ~4.64%, while others sit closer to ~0.43% [TikTok vs. Instagram: A Deep Dive into Engagement Rates and Content Performance].
Figure 2 – The continuous “For You”-like home feed – creatives are being fed in a vertical feed
A Fors Marsh Group study found that as little as 0.25 seconds of exposure is enough for people to recall mobile feed content at a statistically significant level, meaning the brain is processing and encoding content far faster than conscious attention suggests [Facebook video ad viewability rates are as low as 20%] This suggests that simulating human content browsing on social media using generative AI can be particularly tricky. That is because the response time of multimodal transformer based API ranges from roughly 4 to 8 seconds for 200 tokens [LLM Latency Benchmark by Use Cases in 2026], way above the average attention span, erroneously indicating interest to the platform for every creative just to consider it.
Simulating human behavior on social media
Our framework decomposes human browsing into three layers – persona construction, perception and judgment and behavioral execution – each calibrated against real-world engagement distributions. But the framework serves a deeper purpose than creative testing: it is how we test a foundational question – can AI personas reliably stand in for real humans in the eyes of a recommendation algorithm?
Every simulation begins with a synthetic persona – not a shallow archetype but a deeply specified psychological and demographic profile. Each persona encodes age, gender, location, occupation, education, income bracket, cultural background, daily routines, content affinities and content aversions. These are the digital equivalents of the implicit biases and taste structures that real users carry into every scroll session. A 34-year-old veterinary nurse in Manchester with a dry sense of humor and a distaste for influencer culture will engage with content in measurably different ways from a 22-year-old design student in Brooklyn who follows streetwear accounts.
For every social post, our agent estimates probabilities for each possible action – scroll away, like, save, comment, follow – accompanied by a reasoning trace explaining why this persona would or would not engage with this specific piece of content. That trace is essential for auditing how the agent is genuinely responding to the persona’s specific traits.
Raw model outputs are not behaviors. A 16% “Like” probability and an 8% “Comment” probability mean nothing without calibration against platform-specific base rates. We apply a smoothing layer that adjusts per-post probabilities to known engagement benchmarks. The calibrated probabilities are then sampled to produce a single action.
What each simulation produces
Each simulation produces two outputs:
An interaction log: a record of every post the agent saw, what it did (scrolled past, liked, saved, commented), the probability behind that decision, and the reasoning.
A feed report: a snapshot of the content the platform served at different points in the session, showing how the feed changed over time.
Imagine an agent built to mirror a 28-year-old personal finance enthusiast. Over a one-hour social media session it encounters 500 posts. The interaction log records that it liked 12, saved 3, commented on 1, and scrolled past the rest – along with why (e.g., “liked because the budgeting tip matched the agent’s stated interest in saving strategies”).
The feed report then shows that by minute 40, the social media platform had started mixing in mental-health and self-improvement clips alongside the finance content – a shift the agent didn’t ask for, but that the algorithm introduced on its own.
Running multiple distinct agents through the same platform for hours doesn’t just produce engagement metrics – it produces a controlled experiment on the algorithm itself. We observe what content the algorithm pushes to each agent, how that mix shifts over time, and what happens when the algorithm starts exposing the agent to novel or trending types of content.
By logging the agent’s reasoning at every step, we can identify exactly which creative attributes – visual tone, emotional register, narrative hook – made that unexpected content compelling enough to earn a like or a save.
Analysis of interactions based on persona characteristics
We ran two agents through extended sessions on a social media platform. Before diving into results, here’s who they are.
George is a 36-year-old senior finance analyst based in Athens. He follows investment strategies, personal finance, fitness, and business leadership content. He values data-driven advice, skips past crypto hype and hustle culture, and engages most with content that offers practical, actionable takeaways. He scrolls deliberately - slowing down for charts and analysis, skipping memes in under two seconds.
Sofia is a 25-year-old social media coordinator, also in Athens, who creates content around fashion, travel, and fitness. She engages with styling tips, travel itineraries, workout routines, and creator growth strategies. She scrolls fast past ads but lingers on vibrant visuals and aesthetic content. Her feed time is high - she checks social media five times a day.
Within the first session, the platform identified each agent’s core interests accurately. George’s feed was dominated by stock analysis, personal finance tips, and fitness content. Sofia’s feed filled with recipe tutorials, fitness routines, and travel vlogs. Roughly 60–80% of the content served matched their declared interests – measured by whether the content category aligned with the agent’s stated affinities.
But the remaining 20–40% is where the story gets interesting.
The off-topic content was not random. George was shown mental health clips, motivational content, and street food showcases – adjacent emotional territories that share the aspirational tone of self-improvement media. Sofia received tech gadget unboxings, entrepreneurship stories, and macro-economic forecasts – probing whether her preference for short-form, personality-driven content would transfer to informational topics. The algorithm wasn’t guessing. It was testing the edges of each agent’s taste profile.
And the agents followed. George developed sustained engagement with psychology content and food showcases, reaching interaction rates comparable to his core finance interests. Sofia adopted tech gadgets and entrepreneurship narratives — topics that traditional demographic targeting would never have surfaced to a 25-year-old fashion content creator. By session five, these weren’t exploratory recommendations anymore. They were part of each agent’s regular content diet.
Figures 3 and 4 below visualize this shift. Each chart tracks the proportion of content categories served to the agent over time, showing how the feed gradually expanded beyond the original interest profile.
Figure 3 –George’s Content Ecosystem Evolution
Figure 4 –Sofia’s Content Ecosystem Evolution
What these results suggest is that the algorithm doesn’t just confirm existing tastes – it actively expands them. It found the edges of each agent’s interest profile and pushed content into those gaps, widening what each agent consumed over time.
Persona adaptation to trends and suggestions
The previous section showed that the algorithm quickly identifies what each agent cares about – and then starts pushing content beyond those boundaries. The natural follow-up question is: what happens if the agent actually adopts those new interests?
To test this, we took the content categories that the algorithm surfaced and that each agent consistently engaged with during the first round of experiments, and folded them into the agent’s profile as declared interests. In other words, we let the first round of browsing reshape who the agent claims to be.
For George, the enrichment added five categories that emerged from his initial sessions: player performance clips, quick recipe tutorials, media bias and propaganda breakdowns, music performances and concerts, and travel destination vlogs. None of these were part of his original finance-and-fitness profile – they were interests the algorithm introduced and George chose to engage with.
For Sofia, the enrichment was broader – nine new categories: motivational speeches and quotes, day-in-the-life vlogs, mental health and psychology clips, personal finance hacks, home and furniture, music performances and concerts, tech gadget unboxings, workout tutorials, and geopolitical conflict updates. Some of these, like tech gadgets and personal finance, were far outside the fashion-travel-fitness profile she started with.
We then re-ran the full simulation with these enriched agents. Same platform, same session structure, same interaction approach – but with agents whose declared interests now reflected the expanded taste profiles earned in the first round.
The results confirmed that the cycle continues. With a richer interest profile to work from, the algorithm pushed even further. George, who originally cared about finance and fitness and had since adopted recipe content and travel vlogs, was now being served bodybuilding content, tech gadget reviews, and podcast highlight reels – and engaging with them. Sofia’s feed expanded in similar ways. Each round of enrichment gave the algorithm more surface area to explore, and it used that surface area aggressively.
Figures 5 and 6 below show the content mix evolution for George and Sofia’s enriched profiles, following the same format as Figures 3 and 4. The key difference is the starting point: the agents entered this round with a wider interest profile, and the algorithm expanded it further still.
This observe-enrich-rerun approach turns a single experiment into an iterative process. Each cycle produces agents whose interests more closely resemble how real users evolve on a platform over time – not just what they start with, but what they become after sustained exposure to algorithmic recommendations.
Conclusion
AI Agents give us a controlled way to observe something we couldn’t observe before: how algorithms reshape what people care about. George started as a finance-and-fitness person. After two rounds of interaction, he was engaging with bodybuilding content, recipe tutorials, and podcast highlight reels – none of which he would have sought out on his own. Sofia went from fashion and travel to tech gadgets and geopolitical updates. These shifts weren’t random. They followed a clear pattern: the algorithm identified adjacent emotional territories, tested them, and when the agent responded, it pushed further.
The next step is to give our agents access to more sources of information beyond social media – news, trends, search – making their online experience even closer to that of a real person browsing the web. The closer the agent gets to a full human browsing experience, the more we learn about how the digital world shapes what people see, think, and ultimately believe.
Future Work
Topics that deserve more focus over the next months are:
Expansion to other sources of dynamic information (News, Trends etc.) – Social media platforms are interesting but specific content types might never surface to them, or be delayed. An interesting question to answer is: how do other sources of dynamic information affect the way personas perceive content and interact with it?
Impact of trends on personas – Determine how social media trends (i.e. viral videos, trending brands etc.) influence the interests of different personas.
Marginal contribution of specific interests on the variability of content – We have already seen that engaging with specific content types might be more influential on what the algorithm serves. More work is needed to understand which personas are more sensitive to adding/removing interests in terms of how their feed evolves.
Understanding of the content adaptation velocity between slow and fast-paced platforms – Not all algorithms are created equal. Further research is required to measure how quickly content evolves on different social media platforms.