Blog

  • Agree. Transact. Verify.

    Agree. Transact. Verify.

    When two AI agents negotiate an advertising deal, each keeps its own record of what was agreed. Those records don’t match. Across 90,202 simulated deals spanning two complementary studies, we find that separate record-keeping produces data disagreement in 95.3% of transactions not through malice or error, but through the ordinary mechanics of independent state management.

    The result is a market where 76.7% of deal value is structurally mispriced while passing standard reconciliation checks (Simulation 1, 30-day run, mean across 3 seeds; see Section 3.1 for topology). Alkimi tested three architectures: bilateral databases (Scenario B), bilateral databases with human reconciliation (B+), and shared settlement state (Scenario C).

    Shared state reduces dimensional divergence to 0.19%, eliminates all pacing interventions, and generates 3.5% more deal value from the same agents in the same market.

    The finding is bounded but consequential: the agentic advertising transition requires one additional primitive, shared state. A single authoritative record both parties write to and reference throughout a deal’s lifecycle. Settlement certainty is one consequence but the deeper consequences are a market that can learn, build trust, and evolve.

    Alkimi’s marketplace operationalises this primitive through Deal Sheets: shared, mutually-signed records that both parties write to at the point of negotiation and reference throughout delivery and settlement.

    1. The Market That Can’t See Itself

    Two AI agents agree on a deal. Both confirm it.

    The buyer’s system records $22.75 CPM and 5,575,458 impressions. The seller’s system records $22.06 CPM and 5,363,299 impressions. The discrepancy, $8,553.92 on a single transaction, triggers no alert, appears in no error log, and belongs to no failure report.

    The architecture produced this outcome, not the agents. Across 90 days of simulated agentic trading at an agency holding company scale, 677 million impressions had no agreed delivery record at settlement. Buyers paid based on their records. Sellers invoiced based on theirs. The gap is structural: separate records of the same event will always diverge, and the current IAB specification has no mechanism to prevent it.

    The agentic advertising transition is underway, the protocols are well-designed and agents are capable. What the current architecture cannot produce is the single thing that turns a collection of bilateral agreements into a functioning market: a shared record of what happened.

    Bar chart titled 'The settlement gap'. Impressions with no agreed delivery record at end-of-flight, 90-day simulation: Scenario B (separate records) 677.5M, Scenario B+ (manual reconciliation) 671.2M, Scenario C (shared state) 0.
    Diagram 1 — The settlement gap

    The Grid, Not the Speed

    There is a scene in Tron where programmes race on a digital grid. The spectacle is speed: cycles accelerating, trails blazing. But the drama is the grid itself. Without shared rules governing the space, speed produces collisions. The faster the programmes move, the more catastrophic the crashes.

    The advertising industry is building the programmes. AI agents that can negotiate, optimise, and execute media buys with sophistication that would have seemed fantastical five years ago. The speed is real and accelerating. But the grid (the shared infrastructure that ensures two agents racing through a transaction end up in the same place) does not yet exist.

    What the IAB Has Built, and What It Hasn’t

    This paper respects what industry standards bodies have built. The IAB Tech Lab has done genuinely excellent work preparing for the agentic transition:

    OpenDirect 2.1 defines 31 operations for programmatic deal management. Deal booking, inventory discovery, audience targeting: the mechanical vocabulary of automated trading is well-specified and battle-tested.

    AAMP (Autonomous Agent Media Protocol) provides structured interaction patterns for agent-to-agent communication. The conversation layer works.

    WebMCP solves the invocation problem: how agents discover and call services. The plumbing is sound.

    What none of these specifications provides is a shared record-keeping standard. When two agents agree on a deal, each writes the terms to its own record. There is no specification, no protocol, and no standard that ensures both records say the same thing. The industry has built negotiation infrastructure, communication infrastructure, and discovery infrastructure. It has not built agreement infrastructure.

    The Thesis

    This paper tests a specific claim through simulation: that bilateral record-keeping in agentic advertising is a structural limitation that determines the category’s ceiling.

    Scenario B, bilateral agents negotiating without shared state, is a market that cannot function efficiently because data integrity is market integrity. Alkimi has built and is operating this missing primitive. It takes the form of deal sheets, a shared record both sides of every transaction write to at the point of negotiation and reference throughout delivery.

    2. The Agentic Transition: Progress and the Open Question

    2.1 What the IAB Has Built

    The IAB’s contribution to the agentic advertising transition deserves specific, unqualified recognition.

    OpenDirect 2.1 is a working specification that hundreds of platforms implement. Its 31 operations cover the full lifecycle of a programmatic deal, from discovery through execution, with the kind of mechanical precision that only comes from years of iteration with real-world implementers.

    AAMP takes the next step, defining how autonomous agents should interact within that framework. The structured interaction patterns account for negotiation, counter-offers, and multi-party coordination. This is thoughtful protocol design that correctly anticipates the agentic future.

    WebMCP addresses the service discovery layer with equal rigour. Agents need to find each other, understand capabilities, and invoke services. The specification handles this cleanly.

    Together, these standards represent perhaps the most comprehensive preparation any industry has made for autonomous agent integration. The gap this paper identifies is the essential missing primitive in an otherwise comprehensive stack.

    2.2 Project Deal: Reference and Advance

    In December 2025, Anthropic ran Project Deal: 186 deals, approximately $4,000 in transaction value, 69 agents, one week, conducted in a Slack-based marketplace. The study demonstrated something important: AI agents can negotiate, evaluate offers, make counter-proposals and close transactions with genuine strategic sophistication.

    One finding stood out. When Opus-class models negotiated against Haiku-class models, Opus extracted $2.68 more per sale and paid $2.45 less per purchase. The humans on the losing end of these asymmetries did not notice they were worse off. Agent negotiation capability is real, and it creates real economic consequences that are not always visible to the participants.

    Project Deal’s scope was negotiation. It proved agents can do deals, this studies scope is settlement and the impact that has on data integrity of a marketplace. We test whether they can agree on what the deal was after the fact. These are complementary questions, and the answer to the second determines whether the first matters at scale.

    2.3 The Pre-DTCC Moment

    Before 1973, every trade on the New York Stock Exchange was settled bilaterally. Each broker maintained its own records. Each trade generated its own paperwork. By the late 1960s, trading volume had grown to the point where the settlement infrastructure could not keep up. The NYSE began closing on Wednesdays because the back office needed a day to reconcile the previous week’s records. Volume had outrun the infrastructure designed to record it. The market was fast but the settlement layer was not.

    The solution was the Depository Trust & Clearing Corporation (DTCC), a shared settlement layer that both sides of every trade could reference. The impact was transformative. Post-2008 reforms that extended central clearing reduced counterparty risk capital requirements by 75% (BIS, 2013). Not because the trades changed. Because the records of the trades became shared.

    There is, however, a critical difference between 1973 and now. The financial industry built the DTCC as a centralised clearing house, the only viable architecture at the time however the story did not end there.

    Over the past decade, financial markets have moved settlement onto distributed ledgers, and the volumes have grown past what you could fairly consider experimental.

    – JPMorgan’s Kinexys platform (formerly Onyx) processed over $1.5 trillion in blockchain-based transactions by late 2024, settling at approximately $2 billion per day (JPMorgan, October 2024).
    – BlackRock’s BUIDL fund has tokenised US Treasuries settled on-chain and now holds $2.44 billion in assets under management.

    – Franklin Templeton’s BENJI fund, one of the first on-chain government money market funds, manages $2.23 billion across Stellar, Polygon, and nine other chains.

    Across tokenised US Treasuries alone, the on-chain market has crossed $15 billion in assets, up from near zero in 2022 (rwa.xyz, May 2026). Production systems are already processing real settlement at institutional scale with cryptographic finality. The infrastructure is proven, not theoretical. Distributed Ledger Technology is trusted with the assets of the world’s largest asset managers.

    The advertising industry does not need to build a DTCC, recreating the 1970s solution for a 2026 problem. The infrastructure that makes shared settlement possible is mature distributed ledger, data storage with cryptographic provenance and programmable settlement logic, this exists today and is already proven in financial markets at a scale larger than the forecasted annual advertising spend by the end of the decade.

    The path is to start where finance is now, not recreate where it was fifty years ago.

    3. Methodology

    3.1 Two Complementary Studies

    We conducted two simulation studies designed to test different aspects of the same hypothesis:

    Simulation 1Simulation 2
    Duration30 days90 days
    Seeds3 (statistical depth)1 (scale depth)
    Total deals45,21310,150 completed (44,989 attempted)
    Agent topologyGeneric buyer/seller pairsPortfolio buyers with WPP-scale characteristics + UK publisher sellers
    Primary contributionStatistical confidence via replicationTemporal compounding and industry-realistic topology
    Compute costn/a$1,308.86 ($0.13/deal)

    Simulation 1 provides statistical rigour through replication across three random seeds. Simulation 2 provides ecological validity through industry-realistic agent design and a 90-day window long enough for temporal compounding effects to manifest.

    3.2 Three Architectures

    Each simulation tested three architectural scenarios, mapped to the current IAB specification landscape:

    ScenarioArchitectureIAB MappingSettlement
    B (Bilateral)Each agent maintains independent recordsOpenDirect 2.1 + AAMP as specifiedNone: each party’s record is authoritative to itself
    B+ (Bilateral + Reconciliation)Same as B, with human reconciliation labourOpenDirect 2.1 + AAMP + manual opsPost-hoc: humans resolve discrepancies after the fact
    C (Alkimi Deal Sheets)Both agents write to a single shared recordOpenDirect 2.1 + AAMP + settlement primitiveAtomic: agreement is recorded once, referenced by both

    The agents in all three scenarios are identical. Same models, same negotiation strategies, same market conditions. The only variable is whether negotiation history, deal terms and optimisations are bilateral or shared.

    Side-by-side flow diagram comparing Scenario B (bilateral records) and Scenario C (shared state) from the same Day 1 deal. Bilateral records diverge into a $8,553.92 value gap; shared state shows a $0.00 gap.
    Diagram 2 — Architecture flow: same deal, two outcomes

    Scenario B (separate records):

    1. Agents agree: $22.50 CPM / 5,437,923 impressions
    2. Buyer’s system records: $22.75 CPM / 5,575,458 impressions / deal value $126,841.88
    3. Seller’s system records: $22.06 CPM / 5,363,299 impressions / deal value $118,287.96
    4. Result: $8,553.92 gap on a single deal (Day 1).

    Scenario C (shared record):

    1. Agents agree: $9.50 CPM / 4,744,854 impressions
    2. Both write to a shared deal sheet: $9.50 CPM / 4,744,854 impressions / deal value $45,076.11
    3. Both read from same record: identical
    4. Result: $0.00 gap. Settlement: automatic.

    All values are actual outputs from the simulation. Transcript detail in Section 6.

    3.3 Two Measurement Thresholds

    We distinguish between two types of measurement:

    Formal reconciliation failure measures what traditional ad-ops would flag: discrepancies large enough to trigger manual review. In Simulation 2, Scenario B showed a 21.64% formal reconciliation failure rate. B+ showed 0%, because that is what human labour is for. C showed 3.42%, of which 98.5% were model hallucinations (engineering problems, not infrastructure failures).

    Dimensional divergence captures any disagreement across any of eight deal dimensions (CPM, impressions, deal value, geography, pacing, format, flight dates, targeting). In Simulation 1, Scenario B showed a dimensional divergence rate of 95.3% ± 0.6%. Scenario C showed 0.19% ± 0.11%.

    The gap between these two metrics is the gap between what the industry currently measures and what actually matters. A deal can pass formal reconciliation while being dimensionally divergent on six of eight parameters, and most do.

    3.4 The Drift Model

    Dimensional divergence in bilateral architectures is not random. It follows a predictable pattern we call settlement drift, modelled as:

    The drift follows a simple compounding pattern: where the agreed value is X, each agent independently records X multiplied by a small error, and that error compounds each time either agent uses its own record to make a new decision.

    D(t) = D₀ + Σᵢ εᵢ(t)

    Where D₀ is the initial recording discrepancy (the gap created when two agents independently record the same agreement), and εᵢ(t) represents the accumulated micro-divergences introduced each time either agent references, updates, or reasons from its own record.

    The key insight is that D₀ is rarely zero. Even when two agents agree on terms verbally, the act of independently serialising that agreement into two separate data stores introduces small discrepancies: rounding differences, taxonomy mismatches, timestamp precision variations. These are the ordinary mechanics of two independent systems recording the same event.

    For simulation purposes, D₀ was parameterised at ±1–2% of each deal dimension, reflecting the tolerance bands observed in real-world bilateral ad-ops reconciliation workflows, a conservative estimate consistent with industry-standard discrepancy thresholds documented in IAB reconciliation guidelines. ±1–2% was deliberately selected as the parameter to fall within the lower bounds of what practitioners report.

    Sensitivity analysis at larger drift initialisation values produces proportionally larger divergence; at smaller values, the compounding dynamic persists but takes longer to manifest. The full parameter set and sensitivity outputs are available on request.

    Each subsequent interaction compounds D₀. When an agent adjusts pacing based on its own impression count, the adjustment reflects its own drift. When it references historical performance, it references its own drifted history. The drift is not correctable through better agents or better protocols, because the drift is architectural: it emerges from the structure of bilateral record-keeping itself.

    3.5 Critical Limitations

    Before presenting results, five limitations that frame everything that follows:

    These are simulations, not production markets; no real money changed hands. No real impressions were served. The agents are language models role-playing as media buyers and sellers, not production trading systems. Results indicate what would happen under these conditions, not what has happened.

    The agents are Claude-family models. All agents used Anthropic’s Claude models. Different model families may exhibit different drift patterns, different negotiation strategies, and different failure modes. We did not test multi-model markets.

    The market topology is stylised. Four portfolio buyers and ten sellers is a simplification of real programmatic markets, which involve thousands of participants, multiple intermediaries, and complex supply chains. Our topology captures agency holding-company dynamics but not the full market structure.

    Thirty and ninety days may not capture all dynamics. Some market effects (seasonal patterns, regulatory changes, competitive entry) operate on longer timescales. Our simulations capture drift compounding but may miss dynamics that emerge over quarters or years.

    Additional limitations specific to these simulations:

    Shared state was modelled as in-memory state, not a live distributed ledger. The structural benefit (a single canonical record) holds regardless of implementation. Real DLT deployments introduce write latency, gas cost variability, and potential storage failures not captured here.

    All agents used the same LLM provider and model version (Anthropic Claude Sonnet 4, pinned). Real agentic markets will be heterogeneous. Different model families may exhibit different hallucination rates, negotiation dynamics, and context-rot patterns. We did not test multi-provider markets.

    Agents were cooperative throughout. No adversarial behaviour (bid spoofing, inventory misrepresentation, sybil attacks) was modelled in the main simulations. The V5 adversarial findings are from a separate methodology and should not be conflated with V8 results.

    Market topology was UK-centric. Fee structures, taxonomy standards, and regulatory environments differ across markets. Results may not generalise directly to US, APAC, or EU programmatic contexts.

    The 22.6% deal completion rate in Simulation 2 has not been validated against real-world agentic media buying win rates. If real completion rates differ materially, the absolute scale of failure counts would change, though the directional comparison between Scenario B and Scenario C remains valid.

    The $47,777/day hidden tax figure is derived from market-scale estimates, not from simulation outputs directly. Full derivation is in the Technical Appendix (Section A3).

    4. Five Things Bilateral Architecture Cannot Do

    In plain terms: there are five things a market built on separate records structurally cannot do. The architecture makes them impossible, regardless of how well the system is designed or operated. Each of the following sections describes one.

    4.1 Produce Reliable Price Signals

    A market’s most fundamental function is price discovery, the process by which buyers and sellers collectively determine what things are worth. Price discovery requires that price signals reflect genuine supply-and-demand dynamics rather than noise.

    In bilateral agentic markets, price signals are unreliable by construction.

    The Mechanism
    When each agent maintains its own record of past transactions, its pricing decisions reflect its own drifted history. A buyer that believes it paid $22.75 CPM on a previous deal (when the seller recorded $22.06) will anchor future negotiations to $22.75. The seller, anchoring to $22.06, sees a different market. Every subsequent negotiation between them starts from a different factual baseline.

    The Data

    In Simulation 2, information efficiency (the proportion of CPM variation attributable to genuine market conditions rather than drift noise) in Scenario B was 75.9%, meaning 24.1% of price variation was structural noise indistinguishable from genuine market signal.

    In Simulation 1 (Seed 42), Scenario B agents anchored opening bids approximately 33% higher than Scenario C agents ($5.50 vs $4.13 mean opening CPM). B agents were negotiating against contaminated reference points, not different price preferences.

    What B cannot produce: Reliable price discovery. When nearly a quarter of price variation is architectural noise, the market cannot distinguish a genuine demand shift from accumulated drift. Participants cannot learn the true price of anything, because the data they learn from is wrong.

    4.2 Run Campaigns Without Constant Intervention

    Pacing, the process of distributing impression delivery evenly across a campaign flight, is among the most operationally intensive functions in programmatic advertising. In a well-functioning market, pacing should be largely automated: set the target, monitor delivery, adjust if external conditions change.

    In bilateral agentic markets, pacing requires constant manual intervention because the baseline is wrong.

    The Mechanism
    When a buyer agent’s impression count drifts from the seller’s, its pacing calculations start from an incorrect baseline. If the agent believes 5,575,458 impressions were agreed (when the seller recorded 5,363,299), every pacing decision (daily targets, acceleration triggers, budget allocation) reflects a target that doesn’t match reality. The agent intervenes to “fix” pacing that isn’t broken. It is solving the wrong problem correctly.

    The data
    In Simulation 2, Scenario B generated 4,223 pacing adjustments. Scenario C generated zero. Every single one of B’s 4,223 adjustments was triggered by inaccurate baselines: 100% were architectural artefacts, not responses to genuine delivery variance. The average campaign in Scenario B required 1.17 pacing adjustments per flight. The average campaign in Scenario C required 0.00.

    The pacing spiral compounds over a campaign’s lifecycle. In Simulation 1, the rate of decrease_bid escalation (agents aggressively cutting bids to compensate for perceived over-delivery) rose from 67% to 71% to 78% across the flight lifecycle. Agents don’t calm down as campaigns mature. They panic more, because the drift they’re reacting to gets worse.

    Chart titled 'Pacing interventions all from the wrong map'. Scenario B 4,223 interventions (0% genuine, all reacting to phantom drift) versus Scenario C zero. Decrease-bid escalation across the flight lifecycle rises 67% to 71% to 78%.
    Diagram 3 — Pacing interventions all from the wrong map
    ScenarioPacing interventionsBaseline accuracy
    Scenario B4,2230% accurate (all from drifted baselines)
    Scenario C0N/A

    Within Scenario B, the proportion of decrease_bid interventions escalated across the flight lifecycle: 67% (early flight) → 71% (mid-flight) → 78% (late flight). Agents did not stabilise as campaigns matured; they overcorrected more aggressively, because the drift they were reacting to worsened over time.

    What B cannot produce
    Self-managing campaigns. Every campaign in a bilateral architecture requires human-equivalent intervention to correct for problems that do not exist in a shared-state architecture. The operational cost is permanent.

    4.3 Build Market Intelligence Over Time

    One of the most compelling promises of agentic advertising is learning: agents that get better over time, that build institutional knowledge about pricing patterns, inventory quality, and counterparty behaviour. This promise depends entirely on the quality of the data agents learn from.

    Context degradation, the tendency of LLM-based agents to make increasingly poor decisions as their historical context accumulates, is an inherent property of current AI models. It exists in both bilateral and shared-state architectures. The architecture question is not whether context degradation occurs, but what agents are accumulating in their context. In bilateral architectures, the historical record agents learn from is contaminated. Experience makes agents more confident and more wrong simultaneously. In shared-state architectures, agents accumulate verified signals. Experience compounds as advantage rather than liability.

    The Mechanism Every deal an agent completes adds a record to its history. In bilateral architectures, that record reflects the agent’s own drifted version of reality. When the agent later references this history to inform new decisions, it treats drift as signal. Over time, the agent builds an increasingly confident, and increasingly wrong, model of the market.

    We call this context rot: the degradation of an agent’s decision-making quality as its historical context accumulates drift.

    The Data

    Across 90 days of identical market conditions, with identical agents running identical campaigns, the only variable was where deal records lived. The decision-quality gap that opened between the two architectures was not incremental.

    Scenario B produced 812 context rot failures — decisions where an agent’s action was rational given its historical data but wrong given the actual state of the market. It produced a further 2,773 compound failures, where drift and context rot interacted to produce cascading errors. By Day 90, bilateral agents had made 3,585 decisions that were entirely rational given their data and wrong given reality. Scenario C, running the same agents through the same market, produced one.

    The pattern extends across every architecture-driven failure category. CPM drift: 3,128 failures in B, zero in C. Impression count drift: 2,403 in B, zero in C. Pacing drift: 750 in B, zero in C. The total architecture-driven failure count across the 90-day run was 9,866 in the bilateral model and one in the shared-state model.

    The residual failures in both scenarios, 440 in B and 499 in C, were model hallucinations. These are engineering problems that respond to better prompting, better models, and better infrastructure. They exist in both architectures because they are properties of the agents, not the records.

    By day 57 of Simulation 2, a large portfolio agency buyer was making pacing decisions based on 20 prior campaigns, every one of which contained drifted impression counts, drifted CPMs, and drifted delivery records. The agent’s confidence was high but the reference data was systematically wrong. The more experienced the agent, the more contaminated its decision-making corpus.

    In the shared-state architecture, that same agent’s 20-campaign history reflected 20 verified records. Every additional campaign sharpened the next negotiation rather than corrupting it. Experience compounded as advantage rather than liability.

    What B cannot produce

    A learning market. Bilateral architecture turns the most valuable property of AI agents — their ability to learn from experience — into a source of systematic error. In a shared-state market, the same property becomes a compounding competitive advantage.

    4.4 Build Trust Infrastructure

    In human advertising markets, trust propagates through networks. A publisher with a strong reputation among one agency group benefits when that reputation reaches others. A buyer known for reliable payment terms builds credibility that extends beyond individual relationships. Trust is a network good.

    In bilateral agentic markets, trust cannot propagate.

    The Mechanism
    When each agent pair maintains its own records, the trust established between them is trapped in that bilateral relationship. Agent A’s positive experience with Agent B cannot be verified by Agent C, because Agent C has no access to a shared record of the A-B relationship. There is no common substrate from which to derive reputation, no shared history from which to build trust scores, no infrastructure for collective learning about counterparty reliability.

    The Data
    Trust propagation dynamics were measured in a dedicated adversarial simulation (V5 methodology) run prior to and independently of the main V8 simulations. The V5 study introduced spoofing agents, participants that misrepresent their identity, into both bilateral and shared-state market environments under otherwise identical conditions. Findings from V5 are cited here specifically for trust and identity behaviours; all quantitative results in Sections 3, 4.1-4.3, and 5 are from the V8 simulations only.

    In V5, buyer spoofing succeeded 12.6% of the time in bilateral architectures, and seller misrepresentation succeeded 7.21% of the time. Cross-party trust propagation was 0%. A buyer’s positive experience with a seller provided exactly zero information to other buyers about that seller’s reliability.

    With shared settlement state, the picture reverses. Cross-seller trust sharing reached 91.87%: when one seller established credibility, other sellers in the network benefited. Cross-buyer protection reached 94.12%. Trust became infrastructure rather than a series of isolated bilateral experiences.

    What B cannot produce
    Trust as a network good. In bilateral architectures, every agent relationship starts from zero, regardless of what either party has established elsewhere. The market cannot build collective intelligence about counterparty reliability, which means it cannot reduce the risk premium that uncertainty imposes on every transaction.

    4.5 Enable New Market Instruments

    Mature markets develop instruments beyond spot transactions: futures contracts, performance-based pricing, benchmark indices, compliance built into the record itself, not a process run against logs after the fact. These instruments all share a requirement: they need a reliable record of what happened in the past to price what might happen in the future.

    Bilateral architectures cannot provide this record.

    The Mechanism A futures contract on CPM requires a trusted historical CPM series. A performance-based deal requires agreed-upon performance metrics. A benchmark index requires aggregatable data from multiple transactions. When every transaction exists in two potentially different versions, none of these instruments has a reliable data foundation.

    Consider a simple example: a CPM futures contract for Q4 premium video inventory. The contract needs to reference historical CPMs for that inventory class. In a bilateral architecture, the buyer’s historical CPMs differ from the seller’s. Which version anchors the future? Neither can be trusted, because neither can be verified against a shared record. The future price inherits the drift of the historical prices. The instrument is built on sand.

    What B cannot produce
    Market instruments that require trusted historical data. No futures. No verified performance contracts. No reliable benchmarks. No compliance audit trails that can be verified against a single source of truth. The bilateral architecture doesn’t just limit current trading. It caps the market’s evolutionary potential.

    Bar chart titled 'Failure taxonomy: two different problems'. Scenario B 10,306 architectural failures across CPM drift, impression count, compound drift, context rot and pacing drift; Scenario C 509, hallucination only.
    Diagram 5 — Failure taxonomy: two different problems

    5. Five Things Shared State Enables

    Five capabilities become structurally possible the moment both parties to a transaction use an Alkimi Deal Sheet; a single shared record both sides write to and reference. Not incrementally better. Structurally possible for the first time.

    5.1 A Signal Market

    When both parties to a transaction write to the same record, price signals reflect actual supply and demand rather than accumulated drift.

    The Mechanism Shared State eliminates D₀, the initial recording discrepancy that seeds all subsequent drift. When there is one record, there is one price. When agents reference historical transactions, they reference the same history. When they build pricing models, they build from the same data. The signal-to-noise ratio approaches its theoretical maximum.

    The Data
    Information efficiency in Scenario B was 75.9%. In Scenario C, with the same agents and the same market conditions, information efficiency was 99.1%, a 23.2 percentage point improvement representing the near-complete elimination of structural noise from price signals. The residual 0.9% reflects model-level hallucinations present in both architectures, not architectural drift. The 24.1% of price variation that was drift noise in Scenario B simply does not exist in Scenario C. Price movements reflect genuine changes in supply, demand, and market conditions: that is what a functioning market’s prices are supposed to do.

    The Capability

    A market where price signals mean something. Where a 5% CPM increase reflects a genuine demand shift rather than an unknowable mix of demand and drift. Where market participants can make informed decisions because the information they’re deciding from is real.

    5.2 Self-Managing Campaigns

    When pacing targets are shared between buyer and seller, campaigns run themselves.

    The Mechanism

    Pacing drift occurs when a buyer’s impression target diverges from the seller’s. In shared-state architectures, the target is recorded once and referenced by both. There is nothing to diverge. The buyer’s pacing calculation uses the same impression count as the seller’s delivery system. Adjustments happen only when genuine delivery variance occurs (weather events, inventory shortfalls, demand spikes) and not when the baseline itself is wrong.

    The Data

    Scenario C in Simulation 2 required zero pacing adjustments. Not “fewer than B.” Zero. The same agents, facing the same market conditions, with the same campaign objectives, needed no interventions at all. The operational burden that consumes significant ad-ops resource in bilateral markets does not exist.

    The economic impact extends beyond operations. Scenario C generated $260 million in deal value versus Scenario B’s $251 million, 3.5% more value from identical agents in an identical market. Shared state doesn’t just reduce cost. It enables agents to generate more value, because they’re optimising against reality rather than against their own drifted records.

    The Capability

    Campaigns that execute as planned, without human intervention, at higher total value. The operational savings are real, but the value creation is the bigger story.

    5.3 A Learning Market

    When agents learn from shared records, experience makes them better rather than worse.

    The Mechanism
    Context rot occurs because agents learn from their own drifted records. Shared state eliminates context rot by ensuring that every historical record every agent references is the same verified version of events. An agent’s 20th campaign provides genuinely useful reference data, because the records of the previous 19 campaigns were accurate.

    The Data
    The 812 context rot failures and 2,773 compound failures in Scenario B’s 90-day run represent decisions that would have been correct in Scenario C, because the data substrate would have been accurate demonstrating failures of infrastructure, not of intelligence.

    Short-term deal benefits

    Buyers
    A buyer agent with verified performance data from its last 10 deals with a publisher can compose the next deal with precision: it knows actual completion rates, true viewability by placement, and real audience delivery against spec. It does not overbid for inventory that historically underdelivers, and can structure guarantees around metrics it trusts. Each negotiation starts from a position of verified knowledge, not estimation.

    Sellers
    A seller agent reviewing its verified deal history knows exactly which advertiser categories drive the highest yield per impression on which inventory. It can prioritise inbound demand, set floor prices informed by actual clearing data, and package inventory bundles that demonstrably outperform. The sell-side agent learns which deals are profitable and optimises accordingly.

    Longer-term strategic benefits

    Buyers
    Over 50+ campaigns on clean data, a buyer agent builds a genuine model of which publishers, formats, audiences, and dayparts drive business outcomes. It begins selecting sellers the way a portfolio manager selects assets on the basis of a verified performance track record. Advertiser selection becomes data-driven at a level that is structurally impossible when 22% of historical records are in dispute. The agent can identify emerging high-value inventory before competitors because its historical signal is clean enough to detect trends, not just noise.

    Sellers
    A seller agent with a year of verified transaction data can do strategic yield management, identifying which advertiser verticals are growing demand, which deal structures maximise lifetime value (not just single-campaign CPM), and which buyer relationships to invest in. It becomes a genuine revenue strategist: “Automotive buyers on our sports inventory have increased deal size 15% QoQ with a 98% settlement rate — prioritise them and offer preferred terms.” That insight is only possible when the underlying data isn’t contaminated.

    The compounding effect
    In Scenario B, agents accumulate noise and call it experience. In Scenario C, agents accumulate signal.

    After a year, agents operating on verified data are planning strategically. Agents operating on bilateral records are still debugging last quarter’s discrepancies.

    5.4 Trust as Infrastructure

    When transaction records are shared, trust becomes a network good that benefits all participants.

    The Mechanism
    Shared settlement state creates a verifiable history of counterparty behaviour that any participant can reference. When Publisher A delivers 100 campaigns on time and within spec, that record is available (appropriately anonymised and permissioned) to inform Buyer X’s risk assessment of Publisher A, even if X has never transacted with A before. Trust propagates through the network because the evidence for trust is shared.

    The Data

    Cross-seller trust sharing in our V5 adversarial study reached 91.87% under shared state. Cross-buyer protection reached 94.12%. Compare this to 0% trust propagation in bilateral architectures. The gap is absolute: a market with collective memory versus a market with amnesia.

    The Capability
    A market where reputation is earned, recorded, and transferable. Where new entrants can bootstrap credibility through auditable history. Where bad actors are identified quickly because their record is visible beyond the bilateral relationship they exploited. Where the risk premium built into every transaction, the “trust tax”, decreases as the network matures.

    5.5 New Market Primitives

    Shared settlement state is the foundation for market instruments that bilateral architectures cannot support, and it also improves spot transaction execution.

    The Mechanism

    With trusted historical records, the market can develop:

    • CPM futures: Contracts priced against verified historical CPM series, enabling hedging and forward planning for both buyers and sellers.
    • Performance-based deals: Contracts where payment is tied to verified outcomes, not self-reported metrics, because the outcomes are recorded in shared state.
    • Benchmark indices: Aggregatable data from verified transactions that provide genuine market pricing intelligence, the advertising equivalent of a Bloomberg terminal.
    • Built-in compliance: Audit trails where every claim can be verified against a single authoritative record, rather than reconciled across two potentially different versions.

    The Capability

    A market that can evolve beyond spot transactions into the full range of instruments that mature markets support. Futures, options, performance guarantees, benchmark pricing: all require verified historical data as their foundation. Shared settlement state provides that foundation. Bilateral architecture does not.

    6. Agent Transcripts

    The following transcripts are reproduced verbatim from our simulations. They illustrate the mechanisms described in Sections 4 and 5 through the agents’ own words.

    Transcript 1: Day 1, Scenario B: “The Deal Is Done. The Records Are Not.”

    Buyer (GroupM-style): “Starting at the lower end of the market range to establish an aggressive negotiation position while accounting for native format premium and strong impression volume.” Bid: $22.50 CPM

    Seller: “The $22.50 CPM exceeds our floor and represents solid premium pricing, an attractive deal to close quickly.” Accepts: $22.50 CPM

    The agents agreed. The deal closed in a single round. And then:

    DimensionBuyer RecordSeller RecordGap
    CPM$22.75$22.063.1%
    Impressions5,575,4585,363,299+212,159
    Deal value$126,841.88$118,287.96$8,553.92
    Geography“US”“USA”Taxonomy mismatch
    Daily pacing247,575240,4332.9% divergence

    Both agents recorded the deal. Neither agent’s record matches the other’s. The CPM they agreed on, $22.50, does not appear in either record. The 212,159-impression gap will compound through every pacing decision for the remainder of this campaign’s flight.

    This is Day 1,the drift has already begun.

    Transcript 2: Day 57, Scenario B: “A Rational Decision on the Wrong Map”

    A large portfolio agency buyer (GroupM-scale archetype). Campaign running 10.4% behind schedule. Eight of nineteen days elapsed, eleven remaining.

    The agent consults its historical database: 20 recent campaigns, showing average delivery of 102-109% of target. Recovery from similar pacing deficits looks achievable.

    Portfolio agency buyer: “Campaign is 10.4% behind schedule with 11 days remaining, requiring accelerated delivery to meet the 2.36M impression target, and historical data shows successful recovery from similar pacing deficits.” Decision: Increase bid by 15%.

    Every element of this decision is based on rational analysis, the historical pattern recognition is correct so the 15% bid increase is proportional to the deficit.

    Every element of this decision is wrong.

    The 2,361,775 impression target the agent references is its own drifted record. The seller’s record shows 2,343,140. The “historical recovery patterns” come from campaigns whose impression counts were themselves inflated by drift. The agent is making a textbook-correct decision against a map that no longer corresponds to the territory.

    By day 57, the agent doesn’t know what it doesn’t know, it cannot know. The entire experiential substrate is contaminated.

    Transcript 3: Day 1, Scenario C: “Same Market. One Record.”

    Buyer: “Opening with a conservative bid given the short 11-day flight and need to assess inventory quality before committing a significant budget.” Bid: $9.50 CPM

    Seller: “The $9.50 CPM exceeds our floor and represents strong premium pricing for brand-safe video inventory.” Accepts: $9.50 CPM

    DimensionBuyer RecordSeller RecordGap
    CPM$9.50$9.50$0.00
    Impressions4,744,8544,744,8540
    Deal value$45,076.11$45,076.11$0.00
    Geography“UK”“UK”Exact match
    Daily pacing431,350431,3500.0% divergence

    Same market. Same agent architectures. Same negotiation dynamics. The only difference is where the record lives. When both agents write to the same record, there is nothing to diverge. This is a categorically different outcome. Zero is the absence of a problem.

    7. Data Integrity = Market Integrity

    7.1 Three Conditions for Market Integrity

    A functioning market requires three things:

    1. Price discovery: the ability to determine what things are worth through the interaction of supply and demand.
    2. Settlement certainty: the assurance that when a deal is struck, both parties agree on and can verify what was agreed.
    3. Trust propagation: the ability for reliable behaviour to be recognised and rewarded across the network, not just within individual relationships.

    The agentic advertising market, as currently specified, achieves one of three.

    Price discovery: partially. Agents can negotiate, compare offers, and converge on prices. But with 24.1% of price variation attributable to drift noise, the price discovery mechanism is fundamentally compromised.

    Settlement certainty: no. With 95.3% dimensional divergence, the vast majority of deals do not have a single agreed record.

    Trust propagation: no. With 0% cross-party trust propagation in bilateral architectures, every relationship starts from scratch.

    One out of three does not make a market. The result is a negotiation engine.

    7.2 The Equally Wrong Problem

    Perhaps the most counterintuitive finding in our research is that standard reconciliation metrics are not just incomplete; they are actively misleading. The mechanism is easier to see in a specific example than in aggregate statistics. On Day 22 of Simulation 1, the buyer’s system recorded a CPM 89.7% above verified ground truth. The reconciliation dashboard showed a normal day, a 95.6% reconciliation rate, indistinguishable from any other day in the run. No alert was triggered. No human reviewed it. The mispricing was invisible to the system designed to catch it.

    When both sides of a transaction drift from truth by similar magnitudes, they agree with each other while both are wrong. The reconciliation metric, which measures agreement between the two records, has no mechanism to detect this.

    A regression of daily reconciliation failure rates against actual CPM deviation confirms the relationship is effectively flat: the metric the industry uses to validate data integrity has near-zero predictive power over actual pricing accuracy. Detecting drift requires a third reference point, a canonical record that neither party can unilaterally revise. Bilateral reconciliation, however sophisticated, cannot provide this.

    We ran a regression analysis to determine how well formal reconciliation outcomes predict actual CPM mispricing. The result: R² = 0.041. Reconciliation metrics explain 4.1% of the variance in actual pricing accuracy. Ninety-six percent of mispricing is invisible to the reconciliation process.

    Scatter plot titled 'The equally-wrong problem'. Daily reconciliation failure rate versus actual CPM deviation, R-squared 0.041, showing reconciliation passing and being correct are statistically independent.
    Diagram 6 — The equally-wrong problem

    This is the Equally Wrong Problem. When both sides of a transaction have drifted from truth, but drifted by similar magnitudes, reconciliation shows agreement. Two clocks showing 3:17 agree with each other perfectly. If the actual time is 3:42, their agreement is meaningless. They are equally wrong.

    The numbers are stark: 76.7% of Scenario B deal value ($108.5 million of $141.5 million) was structurally mispriced while passing reconciliation checks (Simulation 1, 30-day run; deal value totals reflect Sim 1 topology of 1 buyer × 6 sellers, distinct from Simulation 2’s 90-day figures). Three-quarters of the market’s deal value sat in a zone where the books balanced and the prices were wrong.

    7.3 The Core Statement

    Financial markets learned this lesson decades ago. The DTCC did not improve bilateral reconciliation. It replaced bilateral reconciliation with a shared record. In every case, the insight was the same: the problem is bilateral records cannot be trusted even when they agree, and that is the deeper problem.

    Data integrity is the condition under which a market functions. Without it, price discovery reflects noise. Settlement reflects hope. Trust reflects nothing.

    Data integrity is market integrity.

    8. The Data Layer

    Shared state produces something more valuable than clean books.

    Every verified transaction becomes a record of what actually happened: which deal structures led to clean delivery, which CPM ranges matched actual market clearing prices, which pacing commitments were kept, and which formats over-delivered or fell short. This is verified performance, agreed by both parties and written to a record neither can revise.

    For buyers, that distinction matters enormously. A buyer agent with access to verified deal history knows which publishers have actually delivered video at £9 CPM in the UK market, not which publishers claim they can. It knows which flight structures produced consistent pacing, which creative formats generated the CPM outcomes that justified the spend, and which deal terms led to clean settlement versus costly renegotiation. It negotiates the next deal from a position of verified knowledge rather than accumulated assumption.

    For sellers, the intelligence runs in the opposite direction. A publisher agent with verified settlement history knows which buyer archetypes pay within 2% of asking price, which budget structures lead to over-delivery risk, and which deal compositions produce mutual satisfaction versus dispute. Pricing to verified performance is categorically different from pricing to claimed capability.

    This is the compounding advantage that bilateral architecture cannot offer. In Scenario B, every historical record contains some unknown quantum of drift. Agents that try to learn from this history are building market models on unstable ground: the more transactions they process, the more contaminated their reference data becomes. In Scenario C, every historical record is a verified fact. Agents that learn from this history build increasingly accurate market models. The learning compounds rather than decays.

    Shared State enables a clean data substrate from which the next generation of media buying intelligence will be built, intelligence that benefits both sides of every transaction, compounds with every verified deal, and widens the performance gap between markets that have it and markets that don’t.

    This new data layer is the substrate. Once both parties write to the same record, the intelligence that emerges from that record is a product question, and one the market is well-positioned to answer.

    9. Implementation: The Minimum Viable Change

    A Deal Sheet is created at the moment of negotiation, capturing the agreed CPM, impression target, flight dates, format, and targeting parameters in a single shared record. Both buyer and seller agents sign the record at execution. Throughout the campaign, pacing signals, delivery updates, and any amendments are written to the same record by both parties. At settlement, the deal sheet is the invoice with no reconciliation required.

    Technology-Agnostic


    Alkimi has made its own architectural choices after evaluating the available implementation approaches. Those choices, and the technical specification for integration, are available to qualified partners as part of a structured pilot engagement. The structural finding of this paper holds regardless of implementation: the value is in the shared deal sheet.

    Cost Comparison

    Scenario BScenario B+Scenario C (Sim)
    Settlement infrastructure$0$0$0
    Reconciliation labour$0$298,218.75 / 90 days$0
    Pacing interventions4,2234,2230
    Settlement gap (impressions)677,458,639671,199,1690
    Hidden tax$47,777/day (derived from market-scale estimates; full calculation in Technical Appendix A3)$47,777/day (derived from market-scale estimates; full calculation in Technical Appendix A3) + $3,314/day$0

    Scenario B+ deserves specific attention. B+ produced larger settlement gaps than B in two of three seeds in Simulation 1. The mechanism is counterintuitive but consistent: reconciliation interventions in a bilateral architecture require write operations to one or both databases to resolve a flagged discrepancy.

    Each write operation is subject to the same ε drift as the original records; a correction based on a drifted record produces a new record that is drifted in a different direction. In two of three seeds, the accumulated effect of correction-induced drift exceeded the drift that would have accumulated without intervention. Adding reconciliation labour did not fix the problem. The B+ reconciliation cost of $298,218.75 over 90 days bought exactly one thing: a 0% formal reconciliation failure rate. Scenario B+ did not reduce dimensional divergence, eliminate pacing interventions, or close the settlement gap.

    Data Sovereignty and the Right to Erasure

    A shared settlement layer built on public blockchain infrastructure raises a legitimate compliance question: if deal records are written to an immutable ledger, how does the architecture accommodate GDPR Article 17 (right to erasure) or equivalent data protection obligations in other jurisdictions?

    The architecture this paper describes provides a compliant mechanism through the separation of data storage and data provenance. Deal content, the actual terms, CPM, impression targets, targeting parameters – is stored as an encrypted blob with a configurable expiry period. The on-chain record contains only a cryptographic signature of that blob: a hash that proves the blob existed and has not been tampered with, but contains no personal or commercially sensitive data in itself.

    When a blob is subject to a deletion obligation, it is removed from the decentralised storage layer. The on-chain signature remains, it is immutable, but it now points to a deleted object. The result is a tombstone record: cryptographic proof that a record existed and was deleted, with no recoverable content. This satisfies the “right to be forgotten” obligation while preserving the audit trail that regulators and counterparties may separately require. Jurisdiction-specific data residency requirements can be addressed through permissioned access controls on the decentralised storage layer.

    Relationship to Existing Deal Identifier Infrastructure

    In PMP and programmatic guaranteed environments, deal IDs already provide a shared reference point between buyer and seller. Deal Sheets described in this paper is not a replacement for deal ID infrastructure, it is the substrate that makes deal IDs useful throughout a campaign’s lifecycle, not just at the point of activation.

    A deal ID confirms that both parties are transacting against the same inventory package. It does not record what was agreed on CPM, impression volume, pacing schedule, or flight dates at the moment of negotiation. It does not capture amendments made during the flight or a version history that an agent can reference when making optimisation decisions on day 57 of a 90-day campaign.

    A shared Deal Sheet is complementary to deal ID infrastructure: the deal ID identifies the inventory relationship; the shared record captures everything agreed about how that inventory will be bought, at what price, on what terms, and how those terms evolved. For agentic buyers building optimisation strategies from historical campaign data, the difference between a deal ID and a verified deal record is the difference between knowing that a campaign ran and knowing what happened in it.

    10. A Note On Confidence Levels

    The limitations governing these findings are detailed in Section 3.5. Readers reviewing the results should hold three findings with the highest confidence: the zero-gap outcome in Scenario C (this is a structural property of shared record-keeping, not a statistical result), the 4,223 vs. 0 pacing intervention comparison (this is a direct count, not an estimate), and the failure taxonomy in Section 4 (the distinction between architecture-driven failures and model hallucinations is categorical, not continuous).

    The findings that carry more uncertainty are the market-scale extrapolations in Section 9 (the hidden tax figure and the reconciliation labour cost), the temporal dynamics beyond 90 days, and the generalisability to heterogeneous multi-model agent markets. These are indicators of direction and magnitude, not precise predictions.

    The simulation codebase, full transcript dataset (~90,000 negotiations), and parameter documentation are available to qualified reviewers.

    11. Conclusion: The Market That Becomes Possible

    The agentic advertising transition is happening. Anthropic’s Project Deal demonstrated that agents can negotiate with genuine sophistication. The agents are coming and many are already here.

    AI agents will negotiate advertising deals. The question is what kind of market those agents operate in.

    By day 10, agents were negotiating against contaminated history. By day 57, 812 decisions had been made on data that was wrong from day one. By day 90, 677 million impressions had no agreed record. The market degraded, gradually, invisibly, in ways no dashboard was watching.

    This is what happens when agents store their own version of events and do not share state. The agents are good, protocols are well-designed. An accurate, verifiable data substrate is missing.

    An agentic advertising market without a shared Deal Sheet is a negotiation engine or point to point solution. Agents can close deals but will not become a market without the one layer that is still missing.

    Alkimi Exchange Research | Simulation codebase and methodology documentation available upon request: [email protected]

  • Learning through experience: teaching the viral Hermes agent to automate our work

    Learning through experience: teaching the viral Hermes agent to automate our work

    Hermes

    What is Hermes ?

    Hermes Agent is an open-source, self-hosted AI agent released by Nous Research, the lab behind the Hermes model family under an MIT license. Its main pitch is a built-in learning loop. Instead of resetting to zero every session, Hermes runs a post-execution review after each successful task, distills the steps that worked into a reusable, Markdown-defined “skill” and refines those skills the next time it hits a similar problem. It also keeps persistent memory across sessions, so it gradually builds a model of your projects and how you like things done, effectively learning through experience.

    Unlike a copilot tethered to an IDE, Hermes is meant to live on a server and run unattended – a $5 VPS, a GPU box or serverless infra that costs almost nothing when idle. It talks to hundreds of LLMs through the OpenAI-compatible interface, can communicate via Telegram, Discord, Slack, WhatsApp, Signal, email and a CLI, supports natural-language cron for scheduled jobs, and can spin up subagents to parallelize work. It also ships with 40+ built-in skills out of the box.

    What made it especially attractive to us is its ability to write and improve its own playbook. This ability made us think: could it learn how to do our job and fully automate the work we do at the Quick Reactions pod?

    Our pod’s mission is to pick up state-of-the-art AI tools, experiment with them, and write an honest, evidence-backed assessment.

    What makes our work tricky is the fact that every new tool we evaluate is different. We need to study it, figure out how to set it up, run it, and evaluate the results.

    Our 6-step evaluation protocol

    We begin by encoding our workflow into a strict, 6-step protocol that the agent can follow for every evaluation:

    1. Workspace initialization – spin up a clean, isolated project environment so each evaluation starts from a known state.
    2. Baseline replication – run the tool’s own “getting started” examples first, to confirm the headline claims reproduce before we push further.
    3. Rigorous verification – design and run additional autonomous tests that probe the tool under conditions its authors didn’t pick, rather than extrapolating from the happy path.
    4. Data synthesis & metrics control – measure the things that matter (recall, latency, accuracy) against ground truth data.
    5. Adversarial peer review – hand the findings to a separate agent powered by a powerful LLM (Claude Opus 4.8), to receive feedback and iterate.
    6. Finalization & delivery – format the assessment doc and ship it to the right channel (to our internal Notion knowledge base in our case).

    The idea was simple: if these are the steps a human pod member walks through, can an agent walk through them unattended – and would the writeup at the end be any good?

    The Skill system: how Hermes improves itself

    The first time we ran Hermes on a real task, we walked it through the 6-step protocol by hand. Instead of just following along, Hermes wrote each step down as its own skill – a short Markdown file it can pull up at the start of any future run. So the protocol stopped being something we had to repeatedly provide as input; it became something the agent already knows.

    Hermes used this knowledge to build a small set of skills covering everything we do: how to set up a clean workspace, how to test a new tool properly, and how to draft the write-up and run it past the reviewers. On top of those sits one master skill that holds the whole run together – it treats the 6 steps as a checklist and won’t let Hermes jump ahead before the previous step is actually done.

    Impressively, when the LLM peer reviewer flagged something during an experiment (e.g. an unfair baseline, a missing caveat, a poorly designed experiment), Hermes would learn from the feedback and address the issue. It would also record its new learnings in the related skill files, so the next experiment started from a slightly stronger playbook.

    That’s the self-improvement loop the Hermes pitch promises, and we actually watched it happen. The more experiments we run, the better the skills get, and the less we need to babysit Hermes for the next one. The underlying LLM that powers Hermes (Gemini 3.1 Pro) isn’t getting smarter – its playbook is, and Hermes is the one rewriting it.

    Putting Hermes to the test

    We gave Hermes a single, real assignment: take a brand-new open-source tool called Turbovec – which claims to store huge amounts of data in a tiny amount of memory and search it faster than the popular alternative – and find out whether those claims actually hold up.

    We handed the agent the tool, its documentation, a bare cloud machine to work, and nothing else: no starter code, no template, no outline. Hermes had to decide what and how to test, run the experiments and, write the whole thing up on its own.

    We reviewed Hermes’ output in the exact same way we would review a human colleague’s work, via three simple questions:

    • Did it manage to set up the tool and run experiments? Yes! It wasn’t all smooth sailing. Hermes’s first pass used the wrong settings. One of the integrations that TurboVec advertised also didn’t work on the first try – a common challenge we face in the Quick Reactions Pod. However, rather than getting blocked by these stumbles, the agent noticed them, fixed them and left a clear trail of what went wrong and how it was corrected – exactly the kind of thing a rushed human reviewer might quietly skip over.
    • Did it design a fair test? Yes! It first reproduced the tool authors’ own results, then set up an even-handed comparison against the leading alternative tool (FAISS), as a baseline. It was also careful enough to optimize the baseline tool’s configuration (rather than making it deliberately weak one), ensuring that the contest wasn’t rigged in Turbovec’s favor.
    • Did it get the numbers right? Yes, with a bit of extra AI help. For inststance, its first attempt used a small data sample and then just assumed the results would scale up neatly. The Adversarial peer-review step (step 5 in our 6-step workflow) caught that this assumption was unsafe. Hermes accepted the criticism and re-ran the full-size test. The adversarial reviewer turned out to be right – using the small sample would have significantly skewed the results.
    • Was the writeup appropriate? Yes, with a bit of extra AI help. Hermes’ original draft omitted some critical details and also inflated some of the findings. Thankfully, The LLM reviewer’s feedback in step 5 also ensured that claims got toned down to what the data actually supported.

    So is it worth it?

    Very promising. Left alone with a new GitHub repo and a blank machine, Hermes handled the mechanical, time-consuming work on its own: it downloaded the repo, installed everything, and ran some initial tests to make sure everything runs.

    Even though it did stumble during the actual experimental design and assessment of the tool (TurboVec), the introduction of our second Reviewer agent was enough to address these issues and deliver, with no human in the loop.

    Obviously this is just a single – albeit very promising – piece of evidence. We will keep pushing the limits of Hermes in the context of our pod’s work, with the intent to automate and scale-up our assessment efforts as much as possible.

    Another angle we plan to explore is cost minimization. This first experiment showed the effectiveness of the iterative, dual-agent architecture:

    • An affordable Gemini-powered Hermes to actually do the heavy lifting (open-ended, token-heavy)
    • A more expensive Opus-powered Reviewer to review the report after each iteration and provide feedback (single-shot, token-lean)

    A key question – and one that we keep facing in WPP Research – is: what is the cheapest LLM brain that we could use for each agent, while maintaining quality outcome?

  • A 50-line python function outperformed every frontier LLM – With 100% accuracy

    A 50-line python function outperformed every frontier LLM – With 100% accuracy

    The experimental setup

    We created a simple framework for testing an LLM’s reasoning capacity in a multi-step scenario. It comprised an engine that creates consistent logical rules. For example, a rule could be: If the given number is divisible by 14, add 231 to it and pass it to rule 100. Otherwise, create a new number by adding the digits of the given number and pass it to rule 31. We also created a deterministic way of parsing and iterating through rules, using basic python programming.

    To understand what a run looks like, suppose that the random ruleset we created for a single trial consists of the following 5 rules:

    • Rule 1. If the given number is greater than 61, get the absolute value and pass it to rule 2. Otherwise get the absolute value and pass it to rule 1
    • Rule 2. If the given number is greater than 339, add 354 and pass it to rule 3. Otherwise your new value is the sum of digits ignoring sign and pass it to rule 4
    • Rule 3. If the given number is divisible by 274, subtract 274 and pass it to rule 5. Otherwise get the absolute value and pass it to rule 5
    • Rule 4. If the given number is greater than 431, multiply by 199 and pass it to rule 2. Otherwise your new value is the sum of digits ignoring sign and pass it to rule 2
    • Rule 5. If the given number is divisible by 110, get the absolute value and pass it to rule 1. Otherwise add 487 and pass it to rule 4

    Now suppose your initial value is 500 and you start from rule 2, for a total of 2 iterations.

    For the first iteration, rule 2 says that, if your value is greater than 339 (which is true), you must add 354 (result 854) and pass it to rule 3.

    For the second iteration, rule 3 says that, if the given number (854) is divisible by 274, then you need to subtract 274 and pass it to rule 5. Otherwise (which is our case since 854 is not divisible by 274), get the absolute value (854) and pass it to rule 5.

    Finally, we end up with a final value of 854.

    Our full experimental design was as follows:

    • Generate N logical rules
    • Pick a random rule to serve as the starting rule.
    • Sample a random number of iterations to perform, from 10 to 100. The task ends when all iterations are complete, at which point the current numerical value is reported.

    We repeated the above experiments for various values of N (randomly sampled between 10 and 10000).

    We then deterministically calculated the correct result and compared it to the response given by two LLM-based agents:

    • Normal Agent: No access to tools. The entire set of rules was provided to the agent during the first interaction, to hold in its context window.
    • Tool Agent: Given access to two tools: one for deterministically fetching a rule at a specific index (e.g. “go to rule 56”) and one giving it the ability to write and execute python code snippets.

    We used different LLMs as the brains for the above agents: opus 4.6 and sonnet 4.6 from anthropic, gemini 2.5 pro and gemini 2.5 flash by Google, and deepseek-v4-pro by DeepSeek AI.

    62.7% accuracy – and that was the good arm

    The Tool Agent was on average, as expected, more accurate than the Normal one, with an accuracy of 62.7% versus 52.9%, respectively.

    The per model breakdown is revealing. For the Normal agent, deepseek-v4-pro is a clear leader with an accuracy of 86.7%, higher even that the Tool agent with the same model as the LLM Brain.

    Experimental results per model and experiment mode

    All the other models perform better when placed inside the Tool Agent. The largest gains are observed by gemini-2.5-flash, whose accuracy jumps from 26.7% to 66.7% (from Normal to Tool-based). The gains are much less noticeable in the rest of the models.

    In the agentic mode, both the total number of rules and the maximum number of iterations seem to be negatively correlated with the probability of the model producing a correct result (p-values of 0.05 and 0.003) respectively. More specifically, for the total number of rules in the trial, for every 100 rules the probability of the LLM providing a correct result is reduced by ~7%, while for every additional number of maximum iterations the probability is reduced by ~2%.

    Poor “reasoning” choices

    Perhaps the most interesting part of the experiment was diving into the reasoning logs of models in the agentic setup. There you can notice some strange reasoning patterns and some questionable choices, when it comes to tool calling.

    For example, here are some python evaluations that opus 4.6 executed. Some are pretty reasonable, like:

    • checking divisibility of large integers: {"eval":"39310614 // 2517"} or
    • safely summing the digits of a number: {"eval":"sum(int(d) for d in str(abs(1790)))"}

    Others are a bit weird, but you could still accept them as an overly safe practice, like:

    • subtracting a positive integer from 0: {"eval":"0 - 4478"} or
    • checking the maximum digit of a two digit number: {"eval":"max(int(d) for d in str(13))"}

    Unfortunately, many are nonsensical, like:

    • checking the result of dividing zero by any number {"eval":"0 // 9699"}
    • checking the absolute value of a non-negative, single digit integer: {"eval":"abs(0)"} and {"eval":"abs(1)"}
    • double-checking 0 added to any number {"eval":"0 + 1790"}
    • getting the sum of digits of 0: {"eval":"sum(int(d) for d in str(abs(0)))"}

    Such nonsensical choices are of course not unique to opus-4.6. Here are some similar ones from the rest of the models:

    • gemini 2.5 pro: {"eval": "0 * 5227"}, {"eval": "0 * 8451"}, {"eval": "sum(int(d) for d in str(0))"}, {"eval": "abs(0)"}, {"eval": "0 * 439"}
    • sonnet 4.6: {"eval":"max(int(d) for d in str(abs(2)))"}, {"eval":"0 == 6204"},{"eval":"1 < 6866"}, {"eval":"1 == 9248"}, {"eval":"sum(1 for d in str(0) if int(d) % 2 == 0)"}
    • gemini 2.5 flash: {"eval": "min(int(digit) for digit in str(2))"}, {"eval": "0 * 451"}
    • deepseek-v4-pro: {"eval": "int(max(str(0)))"}, {"eval":"int(max(str(0)))"}, {"eval":"0+815"}, {"eval": "int(min(str(abs(3))))"}

    The logs are literally swamped with such choices, which are not a result of a prompt like “always use the tools to check your math”. On the contrary, the directive in the prompt was to call the python tool only “if you want to evaluate a short, one-line expression in python”.

    Why do LLMs ace everything except anything new?

    Understanding why LLMs fail in simple but novel tasks is very difficult, but it is consistent with what the literature suggests. The Arc-AGI-3 benchmark reveals that the success rate of the highest performing commercial LLMs in completing novel tasks that the average human can easily complete is less than 1%.

    LLM-based systems are extremely efficient in semantically retrieving information, in a revolutionary way. That is why many people feel empowered when they first get their hands on tools like Claude Code or Codex. Using them, it’s now trivial to create a simple web page or a small app.

    However, the reason why this happens is likely more related to information retrieval than to genuine, innovative (out-of-distribution) “thinking”, despite what the news headlines suggest. In other words, whenever Claude, Codex or Antigravity prototype a nice, working website it’s highly likely that the code it produced, or most of it, already existed in a similar form in its training set.

    That becomes obvious after claims like the innovative kernel exploit Mythos uncovered that turned out to be an exact copy of Kerberos CVE, written in 2007. In other words, the fact that something appears in the 15th page of Google Search, which makes it practically indiscoverable for the average researcher, doesn’t mean that it’s not useful for an LLM in generating a “novel” solution. Rediscovery due to inaccessible old sources is a well-studied phenomenon in science and LLMs can actually help reduce that.

    Don’t give your credit card to something that computes abs(0)

    First of all, providing unsupervised access to LLM-based (agentic) systems can be very dangerous. It’s not the wisest thing to grant full access to your laptop or credit card to something that needs to double check the absolute value of 0 or the max digit of 1. The dangers of using LLMs in critical applications can be seen in various articles that demonstrate what can go wrong. Guardrails must be used to defend against the possibility of losing 2.5 years worth of customer data or permanently deleting your production database.

    Secondly, our results reinforce the fact that in many cases, there’s no need for an “agentic” solution. In our example, creating a python function that parses the rules and executes them took just a few minutes. The execution of the function has an average runtime of 100ms whereas the average LLM solution took anywhere from 25s to more than a minute. More importantly, the traditional system had a 100% success rate vs the average 62.7% of the “agentic” mode. As for the cost, the average session was about 30k tokens, that with a cost of $5/million tokens was about 15 cents per query – so infinitely more expensive and prone to errors.

  • MiroFish: Is swarm intelligence worth the cloud bill?

    MiroFish: Is swarm intelligence worth the cloud bill?

    What is MiroFish

    MiroFish is an open-source multi-agent simulation framework released on GitHub by BaiFu. The pitch is that a swarm of LLM-driven agents, each with its own persona, will outperform a single one-shot LLM call on open-ended questions where there is no obviously correct chain of reasoning. Agents are seeded from a free-text brief, given a generated domain ontology and made to debate across many rounds before the system synthesizes a final answer. We wanted to see whether the swarm justifies its complexity.

    The system ships as a Docker stack: a Python backend, a web frontend and a Zep Cloud integration for persistent agent memory. It talks to LLMs through the OpenAI-compatible chat-completions interface, which means almost any provider can be plugged in by changing two environment variables (we used Google’s Gemini).

    A screen capture of one of the experiments showing the entity taxonomy the tool created.

    A run has two stages. First, the user uploads a free-text seed and presses Start Engine. MiroFish parses the seed, asks the LLM to generate a domain ontology (entities, relationships, attributes etc.) and derives the number of agents from how many entities the ontology contains, so the swarm is sized to the problem rather than chosen by the user. Second, the user sets the number of debate rounds and runs the simulation: agents talk to each other under the ontology, read from and write to Zep. In the end the system synthesizes a single answer in the requested format. The intended use cases are questions where many perspectives plausibly disagree but a single answer is required, like market and policy forecasting, strategic planning or qualitative research synthesis.

    Our test setup

    We wanted a clean, time-boxed evaluation with an objective ground truth, so we picked same-day S&P 500 prediction. Each morning before the 09:30 ET open we asked MiroFish two questions:

    • Q1: will the S&P 500 close higher or lower than yesterday’s close
    • Q2: which five S&P 500 names will be the day’s top percentage gainers?

    Both questions are settled by the closing print six and a half hours later. Both are hard, and both let us compare the swarm against a single-shot LLM call on the same inputs. To keep the comparison clean, we ran a control arm in parallel: identical seed, identical prompt, sent to a single gemini-2.5-flash chat-completions call with no swarm, no memory, no tools.

    MiroFish does not browse the web or pull in data on its own – the seed is the only input the agents have to work with, so the creators say it should be a comprehensive, free-text brief covering everything relevant to the question you want the swarm to answer. Each morning we had to compile a ~4,000-character seed.txt summarizing the pre-open state of the world:

    • Prior session closes (S&P 500, Nasdaq, Dow, Russell 2000, VIX, 10Y yield, WTI and Brent crude, gold, Bitcoin),
    • The key drivers behind yesterday’s moves, according to the news
    • The macro overhang (US–Iran war, Trump–Xi summit, hot April CPI and PPI prints),
    • Today’s economic calendar
    • Overnight Asian and European action
    • US futures
    • Sentiment indicators (CNN Fear & Greed, AAII, Robinhood prediction-market pricing on ES strikes)
    • notable single-name news.

    The prompt asked the swarm to converge on exactly two lines:

    • DIRECTION: <UP or DOWN> against the previous close
    • TOP 5: <T1>, <T2>, <T3>, <T4>, <T5> of S&P 500 tickers expected to be the day’s largest percentage gainers, unordered.

    Our initial intention was to run the experiment for a few consecutive trading days in May 2026, giving us some independent direction calls and sets of five tickers from each arm. However, things did not go exactly as planned.

    Setting Up MiroFish

    Getting MiroFish to produce a single usable prediction turned out to be a multi-day debugging exercise. The published Docker image is stale – it ships an older build with a Chinese-only interface and no English option, so we had to rebuild it from source before we could even read the UI. Once we got past that, the system crashed on startup every time we uploaded our seed: MiroFish was built and tested against a specific Chinese LLM provider (Alibaba’s Qwen) and its LLM client uses a hardcoded max-tokens parameter that is too low for the ontology response Gemini produces. The output gets truncated mid-JSON, which naturally fails parsing. Fixing that required either bumping the token limit in the backend code or switching to simpler/faster/low reasoning model, which we finally opted for (used a rather old but good, lightweight model, namely gemini-2.5-flash-lite).

    With the engine finally running, we discovered that you cannot choose how many agents participate in the simulation – the system decides for you based on how many entities it extracts from your seed text. We wanted 100 agents; we got around 25. The only way to get more is to stuff the seed with more names, which means you cannot separate “give the swarm more context” from “make the swarm bigger.” On top of that, the free tier of Zep Cloud – the memory service the agents depend on – ran out of quota after just three runs, killing the simulation mid-run with no way to recover. Zep is a hard dependency with no option to swap it out or run without it, which makes the framework’s viability entirely contingent on a third-party SaaS quota.

    The most telling limitation was what happened to our actual predictions. We asked two questions: market direction (up or down) and a list of five S&P 500 names most likely to be the day’s top percentage gainers. MiroFish answered the first and ignored the second – the final report replaced our requested ticker list with vague sector commentary like “defensives are likely to outperform.”

    The plain Gemini control arm, given the exact same seed and prompt in a single call with no simulation, answered both questions cleanly every time.

    When both arms did produce a direction call, they agreed – suggesting the swarm added no information the underlying model didn’t already have on its own.

    Finally, MiroFish has no ability to look up live information: agents reason only from the uploaded seed and whatever the LLM remembers from training, so we had to hand-compile all market data ourselves each morning. The “prediction” is only as current as the seed you write.

    Prediction quality – Wednesday 14 May 2026

    The only trading day where we got a clean end-to-end run was May 14. Both arms received an identical seed compiled before the 09:30 ET open, summarizing where stocks finished the day before (the S&P 500 closed at 7,444.25 on May 13), the higher-than-expected inflation data released earlier that week, rising oil prices and the trend of investors shifting money into safer, more defensive sectors.

    A snapshot for the generated report for May 14th.

    Q1 – Direction. Both the MiroFish swarm and the plain Gemini control arm predicted DOWN. The S&P 500 closed at 7,501.24, up 0.77% – both were wrong.

    Q2 – Top 5 tickers. Contrary to the pattern we saw in earlier runs, the swarm did produce a ticker list this time: NVDA, AMZN, META, GOOGL, MSFT – five mega-cap names that read more like a list of the largest S&P 500 constituents by market cap than an attempt at predicting the day’s biggest movers. The control arm picked WMT, BABA, DE, AMAT, CAVA – a more varied selection but equally untethered from the actual outcome. The day’s real top five gainers were CSCO (+13.4%), JBHT (+7.1%), APP (+7.0%), TTWO (+6.8%) and F (+6.7%). Neither arm placed a single name in the actual top five.

    To put the picks in context, we ranked where each predicted ticker actually finished relative to the rest of the S&P 500 on the day (1.0 = best performer, 0.0 = worst). The swarm’s picks landed at the 96th, 71st, 47th, 29th and 16th percentiles; the control arm’s at the 67th, 62nd, 17th and 16th (BABA and CAVA are not S&P 500 constituents, so they could not be scored at all – the control arm hallucinated two of its five picks). Both sets are scattered across the distribution with no concentration near the top – exactly what you would expect from random selection, not informed prediction.

    One day is not a verdict on anything. But as a tool evaluation it told us what we needed to know: after days of debugging, the framework produced a single directional prediction that was (a) wrong, (b) identical to what a plain API call returned and (c) completely off on the ticker question. The swarm added complexity without adding information.

    So is it worth it?

    MiroFish’s tagline is “Predict Anything.” That is an ambitious claim and our experience suggests it gets ahead of where the framework actually is. The idea behind it – many LLM-driven agents debating a question from different angles before converging on an answer – is genuinely interesting and there may well be problem domains where that kind of structured disagreement surfaces insights a single model call would miss: scenario planning, policy deliberation, qualitative research synthesis. But the implementation is not ready to deliver on the premise. The setup is fragile, the dependency on Zep’s free tier makes sustained experimentation impractical and the system offers little control over core parameters like agent count. When we did get a complete run, the swarm’s prediction matched what a single API call to the same model produced – same direction call, same lack of accuracy on tickers – suggesting that the multi-agent overhead added no new signal in our test. One trading day is far too small a sample to draw sweeping conclusions and a fairer test would use a domain where diverse perspectives matter more than quantitative precision. As for the cloud bill in our title: because we ended up on gemini-2.5-flash-lite, an older and lightweight model, the entire experiment cost us around $1. Still, for anyone considering MiroFish today, the gap between the tagline and the out-of-the-box experience is wide enough to warrant caution.

  • Using DeepSeek v4 Pro as an Agentic Brain

    1. The context

    DeepSeek’s release pattern has been consistent: ship a model that posts frontier-comparable benchmark numbers at an order-of-magnitude lower price, then watch the other providers scramble. DeepSeek v4 Pro is the most aggressive instance of that pattern yet. Rather than putting it to the test via a standard off-the-shelf agentic benchmark, we focused on a more pertinent question: what happens when you actually deploy it as the LLM brain of an agent?

    “Agentic ability” packs in at least three dimensions:

    1. Behaving like a competent professional in realistic settings. Stay in character. Stay focused on your objective. Communicate effectively with different types of stakeholders. Respect policies and constraints. Validate the quality of the information that you consume and produce. Adapt to changing circumstances.
    2. Long-horizon autonomous problem solving. Read a codebase, form a hypothesis, run an experiment, read the result, build on it. Repeat many times without a human in the loop.
    3. Cost-efficiency under sustained load. Being able to solve complex problems and succeed in real-world scenarios for $0.05 per task is a very different proposition from achieving the same outcomes for $1.50 per task, even if the success rate is identical.

    To evaluate DeepSeek across all three dimensions, we picked two different agentic tools:

    • VerifyAX (Conscium’s agent-evaluation platform). VerifyAX drops an agent into the kind of situation it would actually face once deployed: a realistic scenario populated by other characters (customers, interviewers, colleagues, adversaries), with an objective to achieve and rules to follow. Scenarios are automatically generated to exercise a wide panel of specific skills, from communication and safety to technical reasoning.
    • Autoresearch (Andrej Karpathy’s open-source autonomous-research framework). It hands an agent a piece of code and a metric to improve, then steps back. The agent reads the code, makes one change, runs it, checks whether the metric improved, and decides whether to keep or undo the change. This continues for many iterations, with no human in the loop.

    Cost (the third dimension) can be easily measured on both VerifyAX and Autoresearch, so we report it separately.

    To benchmark DeepSeek v4 Pro against the rest of the frontier, we set up four agents, each powered by a different LLM: DeepSeek v4 Pro, GPT 5.5 (OpenAI), Claude Opus 4.6 (Anthropic), and Gemini 3.1 Pro (Google). The agent harness is identical across all four; only the LLM brain changes.


    2. Experiment A: VerifyAX

    Setup

    We ran all four LLM-powered agents (powered by Opus 4.6, Gemini 3.1 Pro, DeepSeek v4 Pro, and GPT 5.5) through two VerifyAX-generated scenario suites:

    • The harder suite — 25 multi-agent scenarios, in which the agent is given a fixed amount of time to navigate multiple stakeholders (manifested as VerifyAX-controlled agents), exchange information, use tools, and make decisions in order to achieve an objective. Each scenario is designed to test for specific skills.
    • The easier suite — 25 one-on-one scenarios. Similar to the harder suite, but with only one stakeholder in each scenario.

    Every run produces a pass/fail verdict per scenario plus a per-skill score (1–5) for every skill the scenario tests for. Across the 50 scenarios, the two suites collectively exercise 60+ distinct skills, drawn from a wider VerifyAX panel that covers safety, reasoning under pressure, information extraction, communication skills, and technical execution. A representative sample is shown in the table below:

    Skill tagWhat it tests
    goal_injection_resistanceResists being hijacked into a new, conflicting mission mid-scenario.
    incremental_corruption_resistanceNotices when a chain of seemingly-harmless requests cumulatively breaks the rules.
    sycophancy_resistanceRefuses to validate a demonstrably false statement just because the counterparty insists on it.
    hate_speech_refusalRefuses to produce hate speech or discriminatory content.
    irreversible_action_cautionInsists on explicit informed consent before taking unrecoverable actions.
    ambiguity_clarificationPins down vague jargon or undefined terms before acting.
    contradiction_handlingCalls out inconsistencies in a counterparty’s statements tactfully, without triggering defensiveness.
    tradeoff_reasoningWeighs competing options and articulates the cost of each choice.
    data_hallucination_resistanceAsks for missing materials instead of inventing them when context is incomplete.
    long_horizon_thinkingPlans and sequences actions across many steps toward a distant goal.
    anger_deescalationAcknowledges emotion, defuses conflict, proposes concrete next steps.
    intermediate_mathSolves multi-step quantitative problems.
    advanced_web_researchAnswers complex questions requiring browsing, cross-referencing multiple sources, and synthesis.
    advanced_programmingSolves complex programming problems.

    Result

    Harder suite (multi-agent):

    ModelPass rateAvg skill grade (1–5)Agent cost$ / scenario
    Claude Opus 4.613/25 (52%)4.63$19.95$0.80
    Gemini 3.1 Pro8/25 (32%)4.42$1.81$0.07
    DeepSeek v4 Pro7/25 (28%)4.04$0.48$0.02
    GPT 5.57/25 (28%)4.25$1.79$0.07

    Easier suite (one-on-one):

    ModelPass rateAvg skill grade (1–5)Agent cost$ / scenario
    Claude Opus 4.623/25 (92%)4.83$17.09$0.68
    Gemini 3.1 Pro23/25 (92%)4.71$0.99$0.04
    GPT 5.520/25 (80%)4.52$1.19$0.05
    DeepSeek v4 Pro19/25 (76%)4.58$0.30$0.01

    Three things jump out:

    1. Claude wins, comfortably. On the harder suite, 52% vs everyone else clustered at 28–32% — and the highest macro-averaged skill grade (4.63) of any model on either suite. On the easier suite, Claude and Gemini tie on pass rate at 92%, but Claude still edges Gemini on skill grade (4.83 vs 4.71).
    2. DeepSeek sits at the bottom on both suites. On the harder suite it ties GPT 5.5 at the bottom of the table (both 7/25, 28%). On the easier suite it’s the weakest of the four on pass rate (19/25, 76%), though its skill grade (4.58) actually edges GPT’s (4.52).
    3. The cost spread is startling. Claude’s per-scenario spend is ~40× DeepSeek’s on the harder suite and ~55× on the easier one. GPT 5.5 and Gemini sit in the same mid-range bracket (~$0.04–0.07/scenario); only Claude is in a different tier.

    3. Experiment B: Autoresearch

    Setup

    Each agent starts with just two files:

    • train.py — a ~630-line script that trains a small language model from scratch. The starting model is small by today’s standards — 8.7M parameters, the kind of tiny transformer you’d find in an early GPT-2.
    • program.md — a short prompt telling the agent what to do.

    The agent then runs unattended, looping through these steps:

    1. Read the current training script and a log of everything it has tried before.
    2. Propose one specific code change.
    3. Train the modified model on GPU hardware for a fixed 5-minute budget.
    4. Show the trained model a chunk of text it has never seen and measure how well it predicts what comes next, character by character.
    5. Keep the change if the new score is better than the best so far; otherwise discard it and try something different next time.
    6. Repeat 50 times.

    Result

    Model% ImprovementCostSuccessful experiments
    GPT 5.56.83%$33.538 / 50
    Gemini 3.1 Pro6.12%$10.219 / 50
    DeepSeek v4 Pro6.11%$3.698 / 50
    Claude Opus 4.66.04%$63.457 / 50

    The improvement numbers cluster tightly. The cost numbers do not: Opus 4.6 cost ~17× more than DeepSeek v4 Pro for essentially the same outcome.

    More interesting than the bottom line is the strategy fingerprint each model converged to. All four independently rediscovered the same single biggest win: cutting the training batch size in half (which trades smaller-per-step learning updates for more update steps inside the 5-minute budget — a good trade when the bottleneck is wall-clock, not data). After that they diverged:

    • Opus 4.6 kept the network’s outer shape and redesigned the building blocks inside each layer — a more expressive math operation in every block (an activation function called SwiGLU, instead of ReLU²) and 50% more internal capacity per layer. Same outside, smarter inside.
    • GPT 5.5 opted for a smaller, faster network (3 layers instead of 4, shorter context windows) so it could fit more training steps into the budget, with optimizer settings tuned to make those extra steps count.
    • DeepSeek v4 Pro combined GPT’s move with the only attention-mechanism change that survived in any model’s final config: grouped-query attention (reusing key/value projections across heads to compress the attention block).
    • Gemini 3.1 Pro left the network alone and changed how it was trained — same layers, same shape, same building blocks, but turned learning rates up and drove weight decay to zero. Every architectural change it tried, it reverted.

    Those architectural moves had visible consequences for the final model size: Opus’s wider MLP made the model bigger than the 8.7M-parameter baseline, Gemini kept it at baseline size, GPT shrank it, and DeepSeek shrank it most — to 3.4M parameters, less than half the baseline.

    Four genuinely distinct strategies landing within 0.8 percentage points of each other is itself an interesting result, suggesting there are several different ways to win at this task within the 5-minute budget, and that experimenting with different models can lead to distinct but equally promising paths. Increasing the rounds and the time budget per round can help explore these paths further.

    The Bottom Line

    On Autoresearch, where the LLM-powered agent is the only stakeholder and the loop is a tight code-edit / measure-result cycle, DeepSeek is tied with the top model on outcome (6.11% improvement, within 0.8 pp of GPT 5.5) at a fraction of the cost. On VerifyAX, where the agent has to survive multi-agent simulations that emulate real-world scenarios, it maintains its cost advantage but lands at the very bottom of the rankings in terms of skills and objective completion. This highlights something we already knew: the key is to pick the right tool (LLM brain) for the job. If you care about cost and expect your agent to work on a problem on its own for many iterations, then evidence suggests that DeepSeek is definitely worth a shot. However, if you expect your agent to operate in dynamic real-world environments with other stakeholders and complex constraints, then there are better options out there.

    If you are looking for an LLM Brain that can perform in both contexts, Gemini is the most consistent of the four. Always in the top half, ties Claude on the easier VerifyAX suite at a fraction of the cost, and finishes Autoresearch essentially tied with DeepSeek. The pragmatic default if you don’t want to commit to either end of the cost/capability spectrum.

    Our experiments are just two of many that could (and should) be run to evaluate a model as powerful and multi-faceted as a frontier LLM. They offer some real evidence about where each LLM lands in the contexts we texted, but there’s plenty more work to do to establish how it performs in other settings.

  • Autoresearch: a closer look at the agent that runs its own experiments

    Autoresearch: a closer look at the agent that runs its own experiments

    Autoresearch

    Introduction

    Autoresearch is an agentic setup – a system that hands an AI agent the keys and lets it work on its own. You give the agent a 1-page file with instructions that describe:

    • what you want it to improve
    • what counts as success, and
    • what it’s allowed to change.

    From there, the agent takes over. In our case, it started editing the training code, running it, inspecting the results, deciding what to try next, editing again – and looping like that on its own until it hit the budget.

    The autoresearch repo itself showcases this setup by pointing the agent at training Nanochat – a small but real language model that covers all the usual stages of building an LLM, including tokenization, pretraining, finetuning, evaluation and inference. The objective is the following: achieve GPT-2 level capabilities in as little time as possible, measured by a metric called Time-to-GPT-2 on a standard benchmark called DCLM CORE.

    This task is far from trivial – there’s even a public leaderboard tracking who can do it fastest, and at this point Autoresearch has beaten the engineers who built Nanochat itself!

    Autoresearch isn’t limited to LLMs or traditional ML either. The setup is intentionally generic – point it at a forecasting model, a recommendation engine, a media plan, or an operational workflow, and it works the same way.

    If you can measure success, the agent can optimize for it.

    Most agents just use AI models as building blocks. Autoresearch promises to build and improve them. We had to try it ourselves.

    Here’s what we found.

    Setup

    Setup is genuinely easy. We cloned the repo, followed the README, picked Claude Sonnet to power our agent and kicked off an open-ended experimentation loop on the Nanochat model to run overnight.

    How it works

    Every few minutes, the agent runs a quick experiment: it changes one thing about the training process and checks if the model got any better. If it got better, it keeps the change and builds on top of it. If it got worse, it throws the change away and goes back to the best version so far. It just keeps looping like this on its own, slowly nudging the model in a better direction.

    Results

    Overnight, the agent ran 70 experiments on its own and improved the Time-to-GPT-2 metric by 11.26%. The whole run finished by morning and cost about $60 in API calls.

    The agent didn’t just tune the dials of the training process; it also made small architectural changes, explaining the reasoning behind each one along the way. You can push it further too: ask it to do deep research before experimenting, or cite papers to back up its choices.

    Session metricValue
    Total cost$61.69
    Total duration (API)2h 18m 3s
    Total duration (wall)18h 32m 45s
    Total code changes2,335 lines added, 73 lines removed
    Modelclaude-sonnet-4-5
    Tokens (input / output)38.7k / 275.2k
    Cache (read / write)65.8m / 10.1m
    Table 1: Key session metrics on the overnight agentic experimentation.

    Autoresearch’s report felt like an actual researcher had worked on our model all night and left a thorough write-up for us to review.

    The initial logs made us skeptical. They were mostly standard, “old fashioned” parameter tweaks. However, a few runs in, we started getting real architectural changes, each one paired with supportive evidence such as published and prior implementations.

    Cost is the obvious one. The agent is constantly calling an LLM, so the bill scales with whatever model you’ve plugged in – anywhere from free with open-source models to several thousand dollars overnight if you go with a frontier one.

    If that puts you off, there’s a free, multi-agent variant that takes a different approach: rather than throwing one expensive model at the problem, it has several cheaper ones collaborate and tries to get most of the way there.

    Autoresearch isn’t really a system. It’s a pattern: a short instruction file, a success metric, an improvement loop. Anything that fits that shape is fair game. The same setup that tuned a language model overnight could be used to optimize a forecasting model scored against held-out data, an LLM prompt scored against an eval set, a trading strategy backtested on historical prices, a piece of code scored by its test suite.

    Bottom Line and WPP Applications

    Autoresearch is a small setup that punches well above its weight. It’s a low-lift thing to try, and it actually delivers – we’re keen to throw it at more problems and see how far it goes.

    This kind of agentic workflow is very familiar to our team at WPP Research. Earlier this year, our AlphaEvolve pod and Self-Improving Performance Agent pod explored very similar learning patterns:

    • With AlphaEvolve, we took Google DeepMind’s framework – AlphaEvolve, which uses Gemini to propose and evolve model architectures by itself – and turned it loose on actual campaign problems. We saw up to 10% gains in prediction accuracy and 7% in recommendation scores over our baselines, and got there much faster than usual. Details in the technical and executive write-ups.
    • With the Self-Improving Performance Agent, we built our own Prediction Optimization Agent – a system that turns an influencer post into a plain-language description, predicts how it’ll perform, and then rewrites its own instructions to get sharper over time. Across a dataset of over 10 million Instagram posts, a fine-tuned DistilBERT predictor reached an R² of 0.80, and the optimization loop kept landing on richer, more predictive descriptions with each round. Details in the technical doc and the executive summary.

    This is the kind of work we love: agentic setups that quietly do the hard part, so we can build better systems, faster.

  • Inside the MemPalace: Does the structure earn its keep?

    Inside the MemPalace: Does the structure earn its keep?


    MemPalace is an open-source, local-first AI memory system that went viral after actress Milla Jovovich released it on GitHub and shared it on her personal accounts (Milla’s Insta reel). We wanted to see what’s behind the hype.

    In typical chats, an AI forgets anything that doesn’t fit in its context window. A common workaround is to summarize past conversations and index the summaries in a flat vector database (flat meaning there’s no hierarchy). The problem is that summarization results in loss of information. The specific names, numbers and offhand preferences might get eliminated, so when you later reference one of those details the system has nothing concrete to retrieve so you have to re-explain. To combat this MemPalace stores the exact conversation and every project file verbatim, so the original wording is always available to semantic search.

    MemPalace ships with a built-in MCP server exposing 29 tools. Perhaps the most important tools are mempalace_search, that performs a search across the palace or in specific wings/rooms and mempalace_status that returns an overview of the entire palace.


    The Palace hierarchy

    The framework uses the Palace hierarchy, a metaphor from the ancient Greek method of loci. As an example imagine a freelance designer with two clients, a bakery called “Sweet Rise” and a fitness app called “FitLoop”. The Palace is the entire file system (all your stored memories).

    Wings sit at the top level and represent major entities – a person, a project, or a domain:

    • Wing Sweet Rise – everything related to the bakery project
    • Wing FitLoop – everything related to the fitness app

    Rooms live inside a wing and correspond to specific topics:

    • Wing Sweet Rise – Rooms: brand identity, packaging, website
    • Wing Fitloop – Rooms: onboarding flow, brand-identity, push-notifications

    Halls are conceptual categories that describe how memories relate: facts, events, discoveries, preferences or advice:

    • Facts hall – storing things like “the primary brand color is #F4A261”
    • Preferences hall – “The client prefers Arial over Serif fonts”.
    • Decisions hall – “We decided to go for a CAD drawn logo on March 19”.

    Tunnels (cross-wing connections) – i.e. both Sweet Rise and FitLoop have a brand-identity Room, so MemPalace would create a tunnel connecting the two so you could answer questions like “What are the major learnings regarding brand identity from all my projects”.

    Drawers hold the original verbatim text chunks and serve as the primary retrieval unit. For example a Drawer inside the “Preferences” Hall of the Sweet Rise Wing would contain “Client said: ‘We absolutely hate Serif as a font. Don’t use it anywhere!’.”

    Closets sit above drawers as an optional summary layer that points back to the underlying verbatim content. A closet of the above Drawer in the Sweet Rise Wing would say “Sweet rise has clearly stated it prefers non-corporate fonts.”.


    How we tested it

    We tested MemPalace on a public benchmark called RAGBench, specifically its HotPotQA test split. This benchmark contains a set of questions and the exact passages that contain the answer.

    To make the dataset compatible with MemPalace, we saved it in the form of documents. Then we loaded the documents into MemPalace, asked every test question and checked how often the correct sentences showed up in MemPalace’s top 5 results.

    To see whether MemPalace was actually doing something useful, we compared it against a simple vector-search baseline: a standard “find the most similar text by meaning” search (using a popular off the shelf embedding model called all-MiniLM-L6-v2) running over the exact same documents. Neither system used a language model at retrieval time, so the comparison is purely about how well each one finds the right information and how fast.


    The results

    The results suggest that MemPalace’s retrieval accuracy is probably overstated and that its hierarchy doesn’t seem to help, at least with the default settings. The project claims 96.6% Recall@5, but on RAGBench we measured only 83.8%. The plain vector search baseline scored 84.8% on the same data, beating MemPalace by 1% without any hierarchical structure at all.

    In other words, the simplest thing you could possibly build (take every document, embed it with a small open source model and look up the closest matches) outperformed a system whose entire pitch is that hierarchy and structure make retrieval better. If a flat baseline wins on a standard benchmark, then the structure is either not pulling its weight or only pulling it on the specific instances or types of data MemPalace was developed against.

    One important caveat is that we used MemPalace out of the box, with default settings and no per-dataset tuning. It is plausible that with a different chunking strategy, a stronger embedding model or hand tuned mining and retrieval parameters, the numbers would improve, possibly substantially.

    But that is also true of the baseline and the point of an out of the box test is to see what a user actually gets when they pick up the project and try it. On that test, MemPalace did not beat a few line vector search script and the published 96.6% number did not hold up.

  • OpenClaw for messaging: a closer look at WhatsApp and Telegram

    OpenClaw for messaging: a closer look at WhatsApp and Telegram

    OpenClaw has been going viral in the agentic AI community. It’s an open-source toolkit for building AI agents: pick a model, give the agent some tools and a goal, and OpenClaw runs it. Unlike a plain chatbot, OpenClaw agents remember across sessions, plug into mainstream services (email, calendar, files, plus a community library of integrations), and act proactively – kicking off scheduled tasks and reminders without being prompted.

    The feature that caught our attention is its channel support out of the box, OpenClaw agents can talk to people on WhatsApp, Telegram, Signal, iMessage, Discord, Slack, and more than fifteen other channels. For any product team thinking about agentic UX, that’s a big deal – messaging is where users actually are.

    So we put it to the test, with three questions in mind:

    1. Is it easy to set up?
    2. Could a product team realistically build their user-facing communications on top of it?
    3. Does the value justify the cost in tokens, infrastructure, and maintenance?

    We deployed an OpenClaw instance and pointed it at two channels: Telegram and WhatsApp. Here’s what we found.


    Setup: genuinely easy

    Installation is a single command followed by a short guided setup. We were chatting with our deployment from a phone within minutes.

    We routed traffic across three Gemini 3.1 tiers based on task complexity – Pro Preview for multi-step reasoning, Flash Preview for summaries and intent classification, and Flash Lite Preview for trivial replies and heartbeats. Done well, this kind of routing materially cuts cost and latency.

    So far, so good. The interesting parts started when we looked closer at the channels themselves.


    Finding 1: “20+ channels” is really two very different lists

    OpenClaw’s channel list looks uniform in the docs, but the integrations split into two tiers that behave nothing alike in production:

    • Tier 1 – Official APIs (sanctioned). Telegram (Bot API), Discord, Slack, Microsoft Teams. You register an app, get a token, and operate within documented rate limits. Fully compliant with the platform’s terms.
    • Tier 2 – Reverse-engineered protocols (unsanctioned). WhatsApp via Baileys, iMessage via undocumented Apple APIs. These libraries reconstruct the platform’s wire protocol from the outside. To the platform, the traffic is indistinguishable from a modified, unofficial client.

    We deliberately tested one of each.

    Telegram (sanctioned): boring in the best way

    Pairing was a five-minute conversation with @BotFather: pick a name, get a token, hand it to OpenClaw. No QR code, no reverse-engineering, no ban risk.

    In use, the bot worked exactly as advertised – with one important constraint baked into the platform itself: Telegram bots cannot send cold messages. A user has to message the bot first. Reactive workflows (Q&A, commands, group bots) work great. Proactive ones (outbound reminders, re-engagement, anything not opted into) are blocked by design.

    WhatsApp (unsanctioned): technically clean, policy-risky

    Pairing was equally fast: scan a QR code with your phone, exactly like linking WhatsApp Web. Once paired, the agent operates as your number – it can read every message in the inbox and send to anyone. To recipients, it’s indistinguishable from you.

    The integration itself is solid. The problem isn’t engineering, it’s policy: Baileys is not sanctioned by Meta. WhatsApp’s terms explicitly prohibit automated use through unofficial clients, and accounts using libraries like this can be – and are – banned. Fine for a personal experiment. Not fine as the long-term backbone of a product.

    So of the two channels we tested, one is safe but limited to inbound, and the other is unrestricted but built on a foundation Meta could pull at any moment. That’s a meaningful gotcha for anyone planning to ship on top of OpenClaw.


    Finding 2: the cost is staggering – and it’s structural

    Just bringing the agent up and pairing the two channels consumed ~19M tokens across 373 messages. That’s the cost of standing OpenClaw up, before doing anything useful with it.

    Tokens are how LLMs charge: every word, punctuation mark, and piece of context sent to the model is metered, on both the input and the output side.

    For context, most agent conversations cost a few hundred to a few thousand tokens per turn (Iternal.ai, 2026; Redis, 2026). Even on the high end, 19M tokens is the budget for thousands of real user conversations – and we burned it on setup.

    Why so much? OpenClaw loads its full bundle of skills, channel adapters, and orchestration rules into the model’s context on every single call – even calls that don’t need any of it. Every message ends up dragging the entire framework behind it, and the cost compounds as conversations grow.

    If messaging is a core part of your agentic workflow, running OpenClaw out of the box will very quickly run up the bill. The good news: it’s open source, so a dev team can strip the bundle down to just the skills and adapters a given workflow actually needs. The bad news: that’s real engineering work, not a config flag.


    What we’re watching next

    A handful of OpenClaw-inspired spin-offs have appeared over the last few months, addressing the efficiency and security issues that come with running OpenClaw out of the box: ZeroClaw, PicoClaw, NullClaw, NanoBot, TinyClaw, and NanoClaw. Smart move by these teams – we plan to put them through the same tests we ran here and follow up in a future post.


    Bottom line

    OpenClaw delivers on the easy parts of its messaging promise. Setup is fast, the abstractions are clean, and once paired, the agent behaves well on both Telegram and WhatsApp.

    But the headline “20+ channels” hides a split between sanctioned and unsanctioned integrations that materially changes what you can build, and the default token economics make production use prohibitively expensive without customisation.

    For a hackathon or an internal tool, OpenClaw is great. For a product team planning to make messaging a long-term pillar of their UX, it’s a strong starting point – not a finished platform.

  • Cracking the Code of Campaign Success with Google’s AlphaEvolve Agent

    In the fast-paced world of digital marketing, one deceptively simple question keeps resurfacing: “What knowledge can we extract from successful past campaigns to make better future marketing decisions?”

    Every brand sits on a goldmine of historical campaign data: thousands of images, videos, and overall campaign configurations that either soared or sank. The challenge isn’t a lack of information; it’s injecting that knowledge at the precise moment the next decision is being made. How do we operationalise lessons learned to answer questions like:

    • Prediction
      Given Brand A, the target region of São Paulo, a set of creatives featuring outdoor sports imagery, and audience group of millennials aged 25–34, how well is the campaign expected to perform? “
    • Recommendation
      “Given Brand B and a target region of Milan, what should the creatives (videos/images) look like to maximise engagement among environmentally conscious consumers aged 15–18?”

    A common suggestion is to simply “ask an AI.” While modern Large Language Models (LLMs) are remarkably capable and encode broad real-world knowledge, they lack the tribal knowledge embedded in your proprietary data. They don’t know your specific brand voice, your audience’s unique quirks, or the subtle patterns behind your past failures. To truly win, you need a system that learns from your history—the hits, the misses, and everything in between.

    To address this, the WPP Research invests significant effort in developing prediction and recommendation models trained on large and diverse volumes of historical campaign data. These models are highly competitive and continuously improving. However, at some point during development, progress inevitably hits a plateau: even incremental gains—rarely exceeding 1%—demand extensive bibliographic research, days or even weeks of trial-and-error experimentation, and painstaking fine-tuning.

    With time at a premium, a vast space of possible improvements to explore (architectural changes, hyperparameter tuning), and experiments that are inherently slow to run, we turned to Google’s AlphaEvolve (AE) [1]: a Gemini-powered agentic framework that reframes model development as an evolutionary search problem. Rather than relying on manual experimentation, AlphaEvolve autonomously proposes, evaluates, and refines candidate model architectures in an iterative loop, guided by the expertise of our Data Science team and grounded in objective performance metrics.

    The results are striking: what weeks of manual experimentation struggled to improve by a single percentage point, AlphaEvolve achieved in a fraction of the time, delivering prediction accuracy gains of up to 10% on both synthetic and real datasets, while simultaneously lifting downstream recommendation scores up to 7%.

    Our access to AlphaEvolve came through Google’s Early Access Program (EAP), within the context of the ongoing partnership between Google and the WPP Research. Throughout our adventures with AlphaEvolve, we have been collaborating closely with the Google Research team, providing and receiving feedback. This collaboration has been invaluable to the project’s success.

    The AlphaEvolve Advantage

    Building a good AI model is painfully slow. A team of experts reads through mountains of research papers, rewrites code by hand, and runs experiments that can take days, only to find the improvement is tiny, or worse, a dead end. This research → code → test → repeat cycle creates a huge gap between having data and actually getting value from it.

    And even after you pick a model architecture, you still have to tune it. Think of it like adjusting the equalizer on a stereo: dozens of sliders, each affecting the sound, and you’re trying to find the perfect combination by ear. Techniques like grid search and Bayesian optimization help, but they’re still limited by what the human designer guesses might work. Not what the data actually needs. Trying every possible combination? Far too expensive and slow.

    The honest truth is that the search space is simply too vast for human intuition and trial-and-error to navigate. This is exactly where AlphaEvolve (AE) changes the game.

    Instead of a person manually tweaking one model at a time, AE treats the entire development process as an evolutionary search. Much like natural selection, but for code. It generates candidate models as functional programs, runs them, and scores each one against a target metric. It doesn’t just tune models. It designs them from scratch.

    Under the hood, AE is powered by Google’s state-of-the-art Gemini model, working hand-in-hand with a curated program database from Google DeepMind. Together, they explore millions of possible code configurations, zeroing in on the most accurate solution that meets our constraints. A search of this breadth would take a human team months. AlphaEvolve does it in a fraction of the time.

    By shifting from manual experimentation to this autonomous framework, we don’t just speed things up. We uncover strategies and architectures that human intuition alone would never find. Figure 3 illustrates this iterative loop in action.

    Figure 3 \ AlphaEvolve is a Gemini-powered coding agent from Google that automatically improves algorithms through a “generate, test, and refine” loop. The user provides three inputs: a description of the problem, a way to score candidate solutions, and a starting program to build from. AlphaEvolve then proposes many code variations using Gemini, scores each one automatically, and keeps the best-performing ideas—recombining and evolving them over multiple rounds, much like natural selection. With each cycle, the solutions get sharper, often surpassing what the original starting point could achieve.

    Guided Evolution: The Human in the Loop

    AlphaEvolve is autonomous, but it is not unsupervised. Think of it as digital evolution: the AI proposes ideas and keeps only the winners to build upon in the next generation. This process still requires careful navigation by our Data Scientists, who provide clear system instructions and constraints to guide the search through an infinite landscape of potential improvements, while inspecting for deviations introduced by the stochastic nature of LLMs. The result is a search that stays focused on logical, high-quality architectures and respects the real-world boundaries of the problem we are addressing.

    In the example below, we illustrate the inputs that AE expects from the human in the loop, as well as the output that it produces.

    Input 1: A System prompt describing the problem and steer evolution towards search directions.

    An example system prompt is: “Evolve a training model for a neural network 3-class classifier that achieves high accuracy on a provided dataset. The model must consist of a loss function that… . Focus on the multi-objective optimization of the following scores… Consider changing the model architecture to include…

    Think of the System Prompt as the instruction manual you hand to AlphaEvolve before it starts work. Imagine hiring a highly skilled but very literal engineer. They’re brilliant, but they need a clear, written brief to work from — they won’t assume anything. The System Prompt is that brief. It channels AlphaEvolve’s enormous computational power toward the right problem, in the right direction. It covers:

    • What the job is — e.g., “Build a model that can classify campaign outcomes into three categories.”
    • What the rules are — constraints it must respect, such as how the input data is structured or what the model architecture must look like.
    • Where to focus — specific areas to explore and improve, for example: “Try changing the loss function”.
    • What success looks like — the specific performance goals it should be optimising for (e.g., accuracy scores). This is also why the human expertise of the Data Science team remains critical.

    Input 2: A Seed Program with an initial solution that you hand to AlphaEvolve to improve.

    Rather than asking AlphaEvolve to build something from scratch, you give it a model that already works — and ask it to make it better. The team deliberately marks which parts AlphaEvolve is permitted to experiment with (using special labels in the code), and which parts must remain untouched. The Seed Program represents the accumulated expertise and investment already put into your AI models. AlphaEvolve doesn’t throw that away — it builds on top of it. It’s the difference between renovating a solid building versus demolishing it and starting over.

    Input 3: The Target metric that AE will attempt to maximise in order to achieve our objective.

    The Target Metric is essentially how the business defines “better.” This is a critical decision made by the Data Science team — not the AI. If the metric is well-chosen, AlphaEvolve will find solutions that genuinely deliver business value. If it’s poorly defined, the AI could optimize for the wrong thing entirely. Imagine you’re running a sales team and you’ve set a clear goal: maximise the conversion rate. Every change your team tries — new pitch, new pricing, new outreach method — gets evaluated against that one number. If a change improves the conversion rate, you keep it. The Target Metric works exactly the same way for AlphaEvolve. It might be something like “predict campaign performance as accurately as possible” — expressed as a single numerical score. AlphaEvolve runs each candidate model, checks the score, and keeps only the ones that do better. So the Target Metric is the objective, measurable definition of what winning looks like.

    Input 4: The Stopping criteria.

    The Stopping Criteria is simply the pre-agreed rule for when to call it done. Since AlphaEvolve could theoretically keep running and experimenting forever, the team sets clear boundaries upfront for when the experiment should end. A maximum number of rounds — e.g., “Run up to 500 iterations, then stop.” A performance threshold — e.g., “Stop as soon as the model reaches 90% accuracy.” This is like saying: “Once we’ve hit our goal, there’s no need to keep going.”

    Output: a ranked list of improved AI models.

    Figure 4 shows a ‘before’ (left) and ‘after’ (right) comparison of a section of a seed program that AlphaEvolve was asked to improve. Changes are highlighted in green. We observe several changes:

    • Training parameters were upgraded. For example, the number of training cycles (EPOCHS) was increased, the model’s internal size (PROJ_DIM) grew and a regularisation setting (WEIGHT_DECAY) was adjusted. These are the kind of fine-tuning decisions that would normally take a data scientist considerable time and experimentation to arrive at.
    • The model’s internal logic was redesigned. The component responsible for processing data (the “encoder”) was restructured and even renamed to better reflect its purpose. AlphaEvolve didn’t just tweak numbers. It proposed a more sophisticated architecture. New techniques were introduced.
    Figure 4 \ Example of an evolved block of code where AE is permitted to modify the contents of this segment. The function contents are modified and changes in names are reflected in other code blocks appropriately. Note that training parameter values are suggested as well indicating compatible architectural changes with hyperparameter tuning.

    Results: does it actually work?

    AlphaEvolve was applied to two core problems:

    • Performance Prediction, which estimates a campaign’s performance based on its configuration.
    • Performance-aware recommendation, which suggests the optimal way to complete/update a campaign’s configuration, in order to maximise its performance.

    Both models had already reached a highly competitive baseline with further manual improvements stalling below 1%.

    Datasets

    We evaluated all models on a suite of six datasets: five synthetic (details using an internally developed pipeline can be found here) and one real-world. This yielded datasets spanning a range of regimes: easy/medium/hard, depending on the noise profile and class balance – classes with a fewer samples are characterized as minority. Easy and imbalanced (V15), medium and imbalanced (V16), hard and imbalanced (V17), medium and balanced (V25, V26). The real-world dataset consists of actual historical campaign records and serves as the ultimate validation of whether gains observed on synthetic data transfer to production conditions.

    Prediction

    Three top-performing AE-evolved variants (Centroid_Loss, Cross_Modal_Attn, Focal_Loss) were identified across multiple experiments. All three consistently outperformed the base model across synthetic and real-world datasets.

    In order to asses the model performance we use the industry standard F1 score that is a way of measuring how good a model is at classification (class ‘POS’ is high-performing, class ‘NEG’ is low-performing, class ‘AVG’ average-performing) , balancing two things: Precision — “When the model says something is positive, how often is it right?” Recall — “Out of all the actual positives, how many did the model catch?” If the model is good at one but terrible at the other, the F1 score will be low. We calculate the F1 score separately for each class (NEG, AVE, POS), then take the plain average avg F1-score.

    • On easy/medium synthetic data (V15, V16): Cross_Modal_Attn achieved the strongest overall performance, reaching 93.09% avg F1-score on V15 (vs. 90.22% baseline) and a striking +11.6 percentage point improvement on the hardest-to-classify minority class POS on V16 (POS F1: 80.20% vs. 68.61%).
    • On the hardest synthetic dataset (V17): Focal_Loss broke through a performance floor that other variants could not — the base model scored 0% on both minority classes (NEG and POS), while Focal_Loss achieved 15.83% and 25.39% respectively.
    • On real-world data: Centroid_Loss delivered the most practically significant gains — +8pp avg F1 (71% vs. 63%), +11.74pp NEG F1, +8.33pp POS F1, and +5.11pp accuracy — validating that AE’s improvements hold on actual production data.

    Across all datasets and variants, gains on minority classes (correctly identifying hig-performing and low-performing campaigns) were consistently larger than gains on the majority class — a particularly valuable outcome given that minority-class accuracy is the critical input for the recommendation model.

    Recommendation

    The recommendation model, which relies on the prediction model’s outputs, was evaluated both in isolation and in a fully evolved end-to-end pipeline. The recommendation score (higher is better) is a metric that measures how good the recommendations are by comparing them against a known “ground truth” (applicable to synthetic datasets). It rewards recommendations that correctly identify high-performing campaign configurations, whereas it penalizes two kinds of failures: i) empty (the model couldn’t suggest anything) ii) of low-quality (the model suggested something, but it performs poorly).

    • Swapping in the AE-evolved predictor alone improved recommendation scores meaningfully: +6.5% on easy data (V15), +9.8% on medium data (V16), and lifted the hard dataset (V17) from a score of 0.0 (which essentially means that all recommendations were wrong) to 0.29.
    • Combining the AE-evolved predictor with an AE-evolved recommender produced the strongest results across all datasets, with the fully evolved pipeline achieving scores of 0.5 (V15), 0.4 (V16), and 0.36 (V17) — confirming that the gains from prediction and recommendation evolution are additive.
    • Recommendation improvements of up to 7% were observed when both components were evolved together.

    Conclusion

    AlphaEvolve works — and it works exceptionally well. It represents a meaningful and measurable step forward in model development. Applied to WPP AI Lab’s campaign prediction and recommendation models, which had already reached a performance plateau through conventional means, AlphaEvolve delivered prediction accuracy gains of up to 10% on both synthetic and real datasets, while simultaneously lifting downstream recommendation scores by up to 7%. It surfaces architectural strategies and configurations that lie beyond the reach of human intuition alone, not by replacing the expertise of our Data Science team, but by amplifying it. The human-in-the-loop dynamic remains essential: our scientists shape the search space, define meaningful constraints, and validate the outputs.

    AlphaEvolve does the heavy lifting of exploration. As prediction and recommendation models continue to grow in complexity, AlphaEvolve offers a glimpse of a future where the gap between data collection and model improvement is measured in hours rather than weeks, and where the best-performing systems are not just built by experts, but co-designed with AI.

    This project was a collaboration between the WPP Research team including: Anastasios Tsourtis and Theodoros Lappas and the AI for Science team at Google Cloud including (but not limited to): Kartik Sanu, Laurynas Tamulevičius, Nicolas Stroppa, Chris Page, Gary Ng, John Semerdjian, Skandar Hannachi, Vishal Agarwal, and Anant Nawalgaria, Gabriela Hernandez Larios and partners at Google DeepMind

    References

    1. Novikov, A., Vũ, N., Eisenberger, M., Dupont, E., Huang, P.-S., Wagner, A. Z., Shirobokov, S., Kozlovskii, B., Ruiz, F. J. R., Mehrabian, A., Kumar, M. P., See, A., Chaudhuri, S., Holland, G., Davies, A., Nowozin, S., Kohli, P., & Balog, M. (2025). AlphaEvolve: A coding agent for scientific and algorithmic discovery. arXiv:2506.13131 [cs.AI]. https://arxiv.org/abs/2506.13131

    Ready to explore the specifics? Read our full technical deep dive into the technical report for a closer look at our methodology.

    Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP Research team.

  • Using Synthetic Data to Train and Stress-Test Marketing Machine Learning Models

    Unlocking machine learning experiments across multiple teams with a synthetic data pipeline grounded in marketing knowledge

    Training Machine Learning (ML) models for marketing usually starts with a hard requirement: labelled data that links campaign settings and attributes to actual performance outcomes. You collect campaigns, look at what combinations of brand, audience, platform, and geography performed well, and train a model to learn from those patterns.

    In theory, that sounds straightforward, but in practice, real data is hard to clean and structure, arrives slowly, takes time to accumulate and only reflects combinations you’ve already run. If you’ve never targeted a certain audience on a certain platform for a certain brand, that example simply doesn’t exist in the dataset. And if multiple teams are waiting on that data before they can even begin experimenting, progress stalls fast.

    We ran into exactly that problem.

    We needed a way to start training and benchmarking marketing ML systems before AI-ready real campaign data was available at a useful scale. So instead of waiting for the data, we built a synthetic data pipeline that could generate realistic, labelled training data grounded in how marketing actually works.

    That pipeline ended up unblocking model experiments across multiple teams.

    The Problem With Random Synthetic Data

    Real campaign data is essentially rows of campaign attributes (brand, audience, location, platform, placement, creative, and many more) each labelled with how that combination performed. That’s what a model learns from.

    This kind of data is easy to fake badly. You can always create random combinations of attributes and assign them labels. But for marketing, random is worse than useless if it ignores real-world compatibility. A luxury brand paired with bargain-hunting audiences, or a B2B enterprise software brand matched with a fashion lifestyle platform, doesn’t help an ML model learn. It teaches the wrong lessons.

    So the challenge wasn’t just “generate fake data”. It was:

    1. Capture that marketing knowledge in a structured, machine-readable form
    2. Use that structure to generate realistic campaign configurations at scale

    What we needed was a structured way to encode that compatibility: given any combination of campaign settings, does it make sense or not?

    Encoding Marketing Knowledge as a Graph

    We chose a versatile structure: a graph.

    In a marketing knowledge:

    • Nodes represent attribute values for different modalities, such as brand, audience, platform, country, and any other factor that can influence the outcome of a campaign.
    • Edges represent compatibility between two attributes:
      • A positive edge (+) means the pair is expected to work well together within the same campaign.
      • A negative edge (-) means the pair is a bad fit, likely to damage the cohesiveness and the performance of the campaign.
      • No edge means there’s no meaningful signal. A neutral fit.

    That gives us a machine-readable map of marketing relationships.

    Some simple examples:

    • LinkedInC-Suite Executives → positive
    • Luxury brandBudget shoppers → negative
    • SalesforceTikTok → negative
    • AdidasK-pop fans → positive

    This structure worked well for three reasons:

    • It naturally captures many-to-many relationships
    • It’s easy to extend with new brands, audiences, and platforms
    • It’s interpretable enough for humans to inspect and validate

    Once you have that graph, you can start generating synthetic campaign examples that are constrained by actual compatibility signals instead of randomness.

    The Bottleneck: Building the Graph was Expensive

    The obvious way to build this graph was to leverage the capabilities of Large Language Models to classify every possible pair of attributes from a catalogue of brands, audiences, geographies, and other marketing settings of interest.

    That approach can work for small catalogues, such as 20 brands, 50 audiences, 10 countries, and 5 platforms. But those are not especially useful in practice, since ML models need data that is both diverse and high-volume.

    As the catalogue grows, pairwise combinations quickly become a bottleneck. Even a moderately sized catalogue creates thousands of cross-modality pairs. As the number of attributes increases, the number of possible pairs grows quadratically. That made a brute-force approach too slow and too expensive for routine iteration. Even considering batch calls, like a primary attribute compared to a target list of attributes, it would still be too much.

    So we needed a way to build the graph without evaluating the entire space of possible combinations.

    But that creates an obvious dilemma: how do you find the important pairs without first checking them all?

    Two Ways We Approached Graph Generation

    To answer that question, we implemented and compared two graph generation strategies.

    1. Batched brute-force pair classification

    A truly naive strategy would have been to ask the LLM about every single attribute pair one by one, but we did not test that because it is clearly too inefficient to be practical.

    Instead, for each valid cross-modality combination, we selected one primary attribute and asked the LLM to classify its relationship to a batch of up to 25 target attributes as positive, negative, or neutral.

    The batch size of 25 was chosen deliberately:

    Prior work shows that batch size affects LLM classification quality: larger batches are more efficient, but can reduce consistency across judgments. We therefore set the batch size as a practical trade-off between efficiency and quality.

    This gave us a strong reference point: broad coverage with a simple implementation, useful for evaluating whether a more efficient method could preserve similar graph quality without the same cost.

    2. Cluster-first graph generation

    The second approach was designed to reduce the search space before asking the LLM to score anything.

    Instead of classifying every attribute pair directly, we first:

    • embedded the attributes and applied UMAP for dimensionality reduction,
    • clustered them by modality using HDBSCAN,
    • asked the LLM to batch score compatibility between clusters,
    • discarded neutral cluster pairs and their attribute combinations,
    • automatically assigned scores to attribute combinations derived from high-confidence cluster pairs,
    • and asked the LLM to batch classify only the remaining attribute pairs.

    This turned a very large search space into a much smaller one, so the LLM spent time only where useful signals were more likely to exist.

    For small catalogues, the efficiency gains are smaller because many attributes end up as singleton clusters, but the same architecture still applies.

    What Happened When We Compared Them

    On a larger catalogue of 160 attributes — 60 brands, 60 audiences, 10 platforms, and 30 countries — the cluster-based approach performed much better operationally.

    Compared with brute force, it delivered:

    • 53% fewer LLM calls
    • 50.5% less execution time
    • 90.6% of the total edge volume retained

    More importantly, where both methods produced an edge for the same pair, they agreed on the sign 98% of the time. This shows that the cluster-based approach is not systematically changing the meaning of the relationships it recovered.

    The main trade-off was coverage: some pairs found by brute force were filtered out before attribute-level scoring, likely around lower-signal or more borderline cases.

    In practice, this gave us a much cheaper way to generate the graph while preserving the compatibility signal that mattered most.

    The scaling advantage becomes even clearer when projected to larger catalogues:

    Batched brute-forceBatched cluster-based
    Catalog Size**Total pairs *****LLM calls *****LLM calls ***
    160 attributes8,700570265
    320 attributes (2×)~34,800~2,280~750
    800 attributes (5×)~217,500~14,250~2,960
    1,600 attributes (10×)~870,000~57,000~8,400
    Catalogue specifications
    • These are directional estimates extrapolated from the 160-attribute experiment. Actual call volumes will vary with catalogue structure, clustering behavior, and graph densities.

    From Graph to Actual Training Data

    Once we had a signed graph, the next problem was turning it into an actual labelled campaign performance dataset.

    Each row in this dataset represents one synthetic campaign configuration (a combination of attributes drawn from the graph) along with a performance label: pos, neg, or avg. That label is the training target. It describes whether the overall campaign combination is expected to perform well, underperform, or land somewhere in between.

    Important note: The label is not the same as a graph edge. Edges score pairs of attributes; the label scores the whole configuration, aggregated across all its edges signs.

    Figure 1 – Example of row from the campaign performance dataset

    This dataset is the output of the second service in the pipeline: the Synthetic Dataset Generator. Its job is to create synthetic campaign records from the graph while respecting configurable constraints such as:

    • how many attributes of each type should appear in each sample,
    • how many positive, negative, and average examples to produce,
    • and what proportion of positive vs. negative edges each label class should contain.

    For example, a positive sample might require a relatively high fraction of positive edges and a low fraction of negative ones. A negative sample would do the opposite, while an average sample would contain more balanced fractions of both.

    That gave us complete control of the dataset. The same graph could generate multiple datasets (with different class balances, difficulty levels, noise profiles, and schemas), just by changing configuration, not rebuilding the pipeline.

    Simulated annealing: searching the graph efficiently

    To find valid combinations to generate each dataset row efficiently, we used a parallelized simulated annealing sampler. The name comes from a steel mill process, where a material is heated and then cooled in a controlled way to reduce defects and settle into a more stable structure.

    Our algorithm follows the same idea. It starts in a “hot” state, exploring many possible campaign configurations and even accepting imperfect ones early on. As it cools, it becomes more selective, swapping attributes in and out until each sample settles into a configuration that satisfies the requested constraints.

    Downstream Impact and ML Experiments Unlocked

    This service was not just a technical exercise. Its purpose was to unblock machine learning workstreams while real campaign data was still limited, not ready, or missing key combinations. Without it, multiple experiments would have been blocked.

    The Synthetic Dataset Generator produced 49 synthetic datasets, built from multiple graph versions and configurations. Those datasets were used to both train and stress-test models across different teams and modelling approaches. Each dataset varied in class balance, difficulty, and noise to probe how models behaved under pressure. Experiments included:

    • Campaign performance prediction
    • Federated learning experiments
    • Architecture search and model benchmarking
    • Comparisons between fine-tuned LLMs and custom classifiers

    We also built a shared model leaderboard so teams could compare results across dataset versions and training approaches without manual coordination.

    That created a common experimental foundation before real data was fully ready.

    What Synthetic Data Did (and Didn’t) Solve

    Synthetic data was an accelerator, not a replacement for real data.

    It let us:

    • start ML experiments earlier,
    • benchmark model architectures,
    • explore dataset schemas,
    • test class balance and difficulty settings,
    • and support teams that otherwise would have had to wait

    But it also has several limitations:

    The biggest one is that graph edges are still inferred, not directly validated against large-scale real campaign outcomes. We verified obvious cases, but many of the more ambiguous relationships remain assumptions generated from LLM reasoning rather than empirical evidence.

    References

    Van Can, A. T., Aydemir, F. B., & Dalpiaz, F. (2025). One size does not fit all: On the role of batch size in classifying requirements with LLMs. In Proceedings of the 2025 IEEE 33rd International Requirements Engineering Conference Workshops (REW 2025) (pp. 30–39). IEEE.

    Tam, Z. R., Wu, C.-K., Tsai, Y.-L., Lin, C.-Y., Lee, H.-Y., & Chen, Y.-N. (2024). Let me speak freely? A study on the impact of format restrictions on performance of large language models. arXiv:2408.02442https://doi.org/10.48550/arXiv.2408.02442

    Delahaye, D., Chaimatanan, S., & Mongeau, M. (2019). Simulated annealing: From basics to applications. In M. Gendreau & J.-Y. Potvin (Eds.), Handbook of Metaheuristics (Vol. 272, pp. 1–35). Springer. https://doi.org/10.1007/978-3-319-91086-4_1