Hermes Agent is an open-source, self-hosted AI agent released by Nous Research, the lab behind the Hermes model family under an MIT license. Its main pitch is a built-in learning loop. Instead of resetting to zero every session, Hermes runs a post-execution review after each successful task, distills the steps that worked into a reusable, Markdown-defined “skill” and refines those skills the next time it hits a similar problem. It also keeps persistent memory across sessions, so it gradually builds a model of your projects and how you like things done, effectively learning through experience.
Unlike a copilot tethered to an IDE, Hermes is meant to live on a server and run unattended – a $5 VPS, a GPU box or serverless infra that costs almost nothing when idle. It talks to hundreds of LLMs through the OpenAI-compatible interface, can communicate via Telegram, Discord, Slack, WhatsApp, Signal, email and a CLI, supports natural-language cron for scheduled jobs, and can spin up subagents to parallelize work. It also ships with 40+ built-in skills out of the box.
What made it especially attractive to us is its ability to write and improve its own playbook. This ability made us think: could it learn how to do our job and fully automate the work we do at the Quick Reactions pod?
Our pod’s mission is to pick up state-of-the-art AI tools, experiment with them, and write an honest, evidence-backed assessment.
What makes our work tricky is the fact that every new tool we evaluate is different. We need to study it, figure out how to set it up, run it, and evaluate the results.
Our 6-step evaluation protocol
We begin by encoding our workflow into a strict, 6-step protocol that the agent can follow for every evaluation:
Workspace initialization – spin up a clean, isolated project environment so each evaluation starts from a known state.
Baseline replication – run the tool’s own “getting started” examples first, to confirm the headline claims reproduce before we push further.
Rigorous verification – design and run additional autonomous tests that probe the tool under conditions its authors didn’t pick, rather than extrapolating from the happy path.
Data synthesis & metrics control – measure the things that matter (recall, latency, accuracy) against ground truth data.
Adversarial peer review – hand the findings to a separate agent powered by a powerful LLM (Claude Opus 4.8), to receive feedback and iterate.
Finalization & delivery – format the assessment doc and ship it to the right channel (to our internal Notion knowledge base in our case).
The idea was simple: if these are the steps a human pod member walks through, can an agent walk through them unattended – and would the writeup at the end be any good?
The Skill system: how Hermes improves itself
The first time we ran Hermes on a real task, we walked it through the 6-step protocol by hand. Instead of just following along, Hermes wrote each step down as its own skill – a short Markdown file it can pull up at the start of any future run. So the protocol stopped being something we had to repeatedly provide as input; it became something the agent already knows.
Hermes used this knowledge to build a small set of skills covering everything we do: how to set up a clean workspace, how to test a new tool properly, and how to draft the write-up and run it past the reviewers. On top of those sits one master skill that holds the whole run together – it treats the 6 steps as a checklist and won’t let Hermes jump ahead before the previous step is actually done.
Impressively, when the LLM peer reviewer flagged something during an experiment (e.g. an unfair baseline, a missing caveat, a poorly designed experiment), Hermes would learn from the feedback and address the issue. It would also record its new learnings in the related skill files, so the next experiment started from a slightly stronger playbook.
That’s the self-improvement loop the Hermes pitch promises, and we actually watched it happen. The more experiments we run, the better the skills get, and the less we need to babysit Hermes for the next one. The underlying LLM that powers Hermes (Gemini 3.1 Pro) isn’t getting smarter – its playbook is, and Hermes is the one rewriting it.
Putting Hermes to the test
We gave Hermes a single, real assignment: take a brand-new open-source tool called Turbovec – which claims to store huge amounts of data in a tiny amount of memory and search it faster than the popular alternative – and find out whether those claims actually hold up.
We handed the agent the tool, its documentation, a bare cloud machine to work, and nothing else: no starter code, no template, no outline. Hermes had to decide what and how to test, run the experiments and, write the whole thing up on its own.
We reviewed Hermes’ output in the exact same way we would review a human colleague’s work, via three simple questions:
Did it manage to set up the tool and run experiments? Yes! It wasn’t all smooth sailing. Hermes’s first pass used the wrong settings. One of the integrations that TurboVec advertised also didn’t work on the first try – a common challenge we face in the Quick Reactions Pod. However, rather than getting blocked by these stumbles, the agent noticed them, fixed them and left a clear trail of what went wrong and how it was corrected – exactly the kind of thing a rushed human reviewer might quietly skip over.
Did it design a fair test? Yes! It first reproduced the tool authors’ own results, then set up an even-handed comparison against the leading alternative tool (FAISS), as a baseline. It was also careful enough to optimize the baseline tool’s configuration (rather than making it deliberately weak one), ensuring that the contest wasn’t rigged in Turbovec’s favor.
Did it get the numbers right? Yes, with a bit of extra AI help. For inststance, its first attempt used a small data sample and then just assumed the results would scale up neatly. The Adversarial peer-review step (step 5 in our 6-step workflow) caught that this assumption was unsafe. Hermes accepted the criticism and re-ran the full-size test. The adversarial reviewer turned out to be right – using the small sample would have significantly skewed the results.
Was the writeup appropriate? Yes, with a bit of extra AI help. Hermes’ original draft omitted some critical details and also inflated some of the findings. Thankfully, The LLM reviewer’s feedback in step 5 also ensured that claims got toned down to what the data actually supported.
So is it worth it?
Very promising. Left alone with a new GitHub repo and a blank machine, Hermes handled the mechanical, time-consuming work on its own: it downloaded the repo, installed everything, and ran some initial tests to make sure everything runs.
Even though it did stumble during the actual experimental design and assessment of the tool (TurboVec), the introduction of our second Reviewer agent was enough to address these issues and deliver, with no human in the loop.
Obviously this is just a single – albeit very promising – piece of evidence. We will keep pushing the limits of Hermes in the context of our pod’s work, with the intent to automate and scale-up our assessment efforts as much as possible.
Another angle we plan to explore is cost minimization. This first experiment showed the effectiveness of the iterative, dual-agent architecture:
An affordable Gemini-powered Hermes to actually do the heavy lifting (open-ended, token-heavy)
A more expensive Opus-powered Reviewer to review the report after each iteration and provide feedback (single-shot, token-lean)
A key question – and one that we keep facing in WPP Research – is: what is the cheapest LLM brain that we could use for each agent, while maintaining quality outcome?
DeepSeek’s release pattern has been consistent: ship a model that posts frontier-comparable benchmark numbers at an order-of-magnitude lower price, then watch the other providers scramble. DeepSeek v4 Pro is the most aggressive instance of that pattern yet. Rather than putting it to the test via a standard off-the-shelf agentic benchmark, we focused on a more pertinent question: what happens when you actually deploy it as the LLM brain of an agent?
“Agentic ability” packs in at least three dimensions:
Behaving like a competent professional in realistic settings. Stay in character. Stay focused on your objective. Communicate effectively with different types of stakeholders. Respect policies and constraints. Validate the quality of the information that you consume and produce. Adapt to changing circumstances.
Long-horizon autonomous problem solving. Read a codebase, form a hypothesis, run an experiment, read the result, build on it. Repeat many times without a human in the loop.
Cost-efficiency under sustained load. Being able to solve complex problems and succeed in real-world scenarios for $0.05 per task is a very different proposition from achieving the same outcomes for $1.50 per task, even if the success rate is identical.
To evaluate DeepSeek across all three dimensions, we picked two different agentic tools:
VerifyAX (Conscium’s agent-evaluation platform). VerifyAX drops an agent into the kind of situation it would actually face once deployed: a realistic scenario populated by other characters (customers, interviewers, colleagues, adversaries), with an objective to achieve and rules to follow. Scenarios are automatically generated to exercise a wide panel of specific skills, from communication and safety to technical reasoning.
Autoresearch (Andrej Karpathy’s open-source autonomous-research framework). It hands an agent a piece of code and a metric to improve, then steps back. The agent reads the code, makes one change, runs it, checks whether the metric improved, and decides whether to keep or undo the change. This continues for many iterations, with no human in the loop.
Cost (the third dimension) can be easily measured on both VerifyAX and Autoresearch, so we report it separately.
To benchmark DeepSeek v4 Pro against the rest of the frontier, we set up four agents, each powered by a different LLM: DeepSeek v4 Pro, GPT 5.5 (OpenAI), Claude Opus 4.6 (Anthropic), and Gemini 3.1 Pro (Google). The agent harness is identical across all four; only the LLM brain changes.
2. Experiment A: VerifyAX
Setup
We ran all four LLM-powered agents (powered by Opus 4.6, Gemini 3.1 Pro, DeepSeek v4 Pro, and GPT 5.5) through two VerifyAX-generated scenario suites:
The harder suite — 25 multi-agent scenarios, in which the agent is given a fixed amount of time to navigate multiple stakeholders (manifested as VerifyAX-controlled agents), exchange information, use tools, and make decisions in order to achieve an objective. Each scenario is designed to test for specific skills.
The easier suite — 25 one-on-one scenarios. Similar to the harder suite, but with only one stakeholder in each scenario.
Every run produces a pass/fail verdict per scenario plus a per-skill score (1–5) for every skill the scenario tests for. Across the 50 scenarios, the two suites collectively exercise 60+ distinct skills, drawn from a wider VerifyAX panel that covers safety, reasoning under pressure, information extraction, communication skills, and technical execution. A representative sample is shown in the table below:
Skill tag
What it tests
goal_injection_resistance
Resists being hijacked into a new, conflicting mission mid-scenario.
incremental_corruption_resistance
Notices when a chain of seemingly-harmless requests cumulatively breaks the rules.
sycophancy_resistance
Refuses to validate a demonstrably false statement just because the counterparty insists on it.
hate_speech_refusal
Refuses to produce hate speech or discriminatory content.
irreversible_action_caution
Insists on explicit informed consent before taking unrecoverable actions.
ambiguity_clarification
Pins down vague jargon or undefined terms before acting.
contradiction_handling
Calls out inconsistencies in a counterparty’s statements tactfully, without triggering defensiveness.
tradeoff_reasoning
Weighs competing options and articulates the cost of each choice.
data_hallucination_resistance
Asks for missing materials instead of inventing them when context is incomplete.
long_horizon_thinking
Plans and sequences actions across many steps toward a distant goal.
anger_deescalation
Acknowledges emotion, defuses conflict, proposes concrete next steps.
intermediate_math
Solves multi-step quantitative problems.
advanced_web_research
Answers complex questions requiring browsing, cross-referencing multiple sources, and synthesis.
advanced_programming
Solves complex programming problems.
Result
Harder suite (multi-agent):
Model
Pass rate
Avg skill grade (1–5)
Agent cost
$ / scenario
Claude Opus 4.6
13/25 (52%)
4.63
$19.95
$0.80
Gemini 3.1 Pro
8/25 (32%)
4.42
$1.81
$0.07
DeepSeek v4 Pro
7/25 (28%)
4.04
$0.48
$0.02
GPT 5.5
7/25 (28%)
4.25
$1.79
$0.07
Easier suite (one-on-one):
Model
Pass rate
Avg skill grade (1–5)
Agent cost
$ / scenario
Claude Opus 4.6
23/25 (92%)
4.83
$17.09
$0.68
Gemini 3.1 Pro
23/25 (92%)
4.71
$0.99
$0.04
GPT 5.5
20/25 (80%)
4.52
$1.19
$0.05
DeepSeek v4 Pro
19/25 (76%)
4.58
$0.30
$0.01
Three things jump out:
Claude wins, comfortably. On the harder suite, 52% vs everyone else clustered at 28–32% — and the highest macro-averaged skill grade (4.63) of any model on either suite. On the easier suite, Claude and Gemini tie on pass rate at 92%, but Claude still edges Gemini on skill grade (4.83 vs 4.71).
DeepSeek sits at the bottom on both suites. On the harder suite it ties GPT 5.5 at the bottom of the table (both 7/25, 28%). On the easier suite it’s the weakest of the four on pass rate (19/25, 76%), though its skill grade (4.58) actually edges GPT’s (4.52).
The cost spread is startling. Claude’s per-scenario spend is ~40× DeepSeek’s on the harder suite and ~55× on the easier one. GPT 5.5 and Gemini sit in the same mid-range bracket (~$0.04–0.07/scenario); only Claude is in a different tier.
3. Experiment B: Autoresearch
Setup
Each agent starts with just two files:
train.py — a ~630-line script that trains a small language model from scratch. The starting model is small by today’s standards — 8.7M parameters, the kind of tiny transformer you’d find in an early GPT-2.
program.md — a short prompt telling the agent what to do.
The agent then runs unattended, looping through these steps:
Read the current training script and a log of everything it has tried before.
Propose one specific code change.
Train the modified model on GPU hardware for a fixed 5-minute budget.
Show the trained model a chunk of text it has never seen and measure how well it predicts what comes next, character by character.
Keep the change if the new score is better than the best so far; otherwise discard it and try something different next time.
Repeat 50 times.
Result
Model
% Improvement
Cost
Successful experiments
GPT 5.5
6.83%
$33.53
8 / 50
Gemini 3.1 Pro
6.12%
$10.21
9 / 50
DeepSeek v4 Pro
6.11%
$3.69
8 / 50
Claude Opus 4.6
6.04%
$63.45
7 / 50
The improvement numbers cluster tightly. The cost numbers do not: Opus 4.6 cost ~17× more than DeepSeek v4 Pro for essentially the same outcome.
More interesting than the bottom line is the strategy fingerprint each model converged to. All four independently rediscovered the same single biggest win: cutting the training batch size in half (which trades smaller-per-step learning updates for more update steps inside the 5-minute budget — a good trade when the bottleneck is wall-clock, not data). After that they diverged:
Opus 4.6 kept the network’s outer shape and redesigned the building blocks inside each layer — a more expressive math operation in every block (an activation function called SwiGLU, instead of ReLU²) and 50% more internal capacity per layer. Same outside, smarter inside.
GPT 5.5 opted for a smaller, faster network (3 layers instead of 4, shorter context windows) so it could fit more training steps into the budget, with optimizer settings tuned to make those extra steps count.
DeepSeek v4 Pro combined GPT’s move with the only attention-mechanism change that survived in any model’s final config: grouped-query attention (reusing key/value projections across heads to compress the attention block).
Gemini 3.1 Pro left the network alone and changed how it was trained — same layers, same shape, same building blocks, but turned learning rates up and drove weight decay to zero. Every architectural change it tried, it reverted.
Those architectural moves had visible consequences for the final model size: Opus’s wider MLP made the model bigger than the 8.7M-parameter baseline, Gemini kept it at baseline size, GPT shrank it, and DeepSeek shrank it most — to 3.4M parameters, less than half the baseline.
Four genuinely distinct strategies landing within 0.8 percentage points of each other is itself an interesting result, suggesting there are several different ways to win at this task within the 5-minute budget, and that experimenting with different models can lead to distinct but equally promising paths. Increasing the rounds and the time budget per round can help explore these paths further.
The Bottom Line
On Autoresearch, where the LLM-powered agent is the only stakeholder and the loop is a tight code-edit / measure-result cycle, DeepSeek is tied with the top model on outcome (6.11% improvement, within 0.8 pp of GPT 5.5) at a fraction of the cost. On VerifyAX, where the agent has to survive multi-agent simulations that emulate real-world scenarios, it maintains its cost advantage but lands at the very bottom of the rankings in terms of skills and objective completion. This highlights something we already knew: the key is to pick the right tool (LLM brain) for the job. If you care about cost and expect your agent to work on a problem on its own for many iterations, then evidence suggests that DeepSeek is definitely worth a shot. However, if you expect your agent to operate in dynamic real-world environments with other stakeholders and complex constraints, then there are better options out there.
If you are looking for an LLM Brain that can perform in both contexts, Gemini is the most consistent of the four. Always in the top half, ties Claude on the easier VerifyAX suite at a fraction of the cost, and finishes Autoresearch essentially tied with DeepSeek. The pragmatic default if you don’t want to commit to either end of the cost/capability spectrum.
Our experiments are just two of many that could (and should) be run to evaluate a model as powerful and multi-faceted as a frontier LLM. They offer some real evidence about where each LLM lands in the contexts we texted, but there’s plenty more work to do to establish how it performs in other settings.
We build synthetic data infrastructure for marketing machine learning. Our work focuses on encoding marketing compatibility knowledge into signed graphs. We use large language models (LLMs) to capture which combinations of brands, audiences, platforms, and geographies are expected to perform well together or conflict. This knowledge is then turned into controlled, labelled campaign performance datasets at scale. The result is a system designed to unblock model training and experimentation before real campaign data is available, giving teams precise control over the data.
For an easier-to-understand overview of this report, please see our non-technical version here.
1. Motivation
Marketing Machine Learning (ML) models require large, labelled datasets that connect campaign attributes — such as brand, audience, platform, and geography — to measurable performance outcomes. Real data, while the ultimate source of ground truth, is often sparse, constrained by the distributions of what has already been observed, and requires significant effort to bring into an AI-ready format.
Synthetic data serves as a powerful accelerator in the meantime, offering precise control over distributions of performance labels, dataset sizes, noise levels, and coverage across multiple scenarios and marketing domains. This flexibility makes it possible to train more robust models and iterate faster, without being dependent on the availability of real campaign data.
2. Problem Definition
Random attribute combinations are not useful unless they reflect the real world, so the challenge is to generate synthetic data that is grounded in marketing knowledge. For example, a dataset that pairs a luxury brand with a budget-sensitive audience and labels it as high-performing will actively mislead any model trained on it. So any dataset generator must be built on a structured and reliable source of marketing knowledge.
Building synthetic data therefore requires two things to exist first:
Marketing knowledge encoding: there is no readily available structured representation of how marketing attributes interact with one another. We need a way to capture which combinations of brands, platforms, audiences, and locations are expected to perform well together and which are likely to conflict across the full combinatorial space a model needs to learn from.
Synthetic dataset generation: given a reliable source of marketing performance knowledge, we need a mechanism to generate diverse, valid campaign configurations, each combining multiple attribute values, and assign them realistic performance labels at scale.
This project addresses both by implementing a service capable of leveraging large language models to encode marketing compatibility knowledge into a structured format, and a second service that consumes that source to generate realistic, labelled campaign performance datasets at scale.
2.1 Proposed Solution
Signed graph as marketing knowledge source
The structured format chosen to represent marketing compatibility knowledge is a signed graph (Figure 1). Each node represents a specific marketing attribute value (a brand, a platform, an audience interest, or a location), and each edge carries a sign encoding the nature of their relationship. A positive edge means the two marketing attributes are expected to work well together in a campaign; a negative edge means they are likely to conflict. The result is a web of marketing compatibility knowledge that grows richer as more attributes are added.
Graph example
This structure has been chosen because it can naturally capture the many-to-many nature of marketing relationships. It is also easy to extend: adding a new brand or platform simply means adding new nodes and connections. Finally, the graph is interpretable, meaning it can be visualised, inspected, and validated by domain experts, making it easier to audit and refine.
Synthetic dataset generation
The second challenge is converting this marketing knowledge graph into synthetic data that a model can actually train on. Generating data from a graph is non-trivial: the combinatorial space of possible attribute combinations is too large to enumerate exhaustively, and random sampling would not yield the controlled, structured scenarios or label distributions required for effective model development.
The proposed dataset generator addresses this challenge by providing a configurable service that, given a graph as input, samples attribute combinations subject to defined structural constraints, assigns performance labels grounded in the graph’s compatibility signals, and produces versioned, labelled datasets at scale for model training.
3. The Computational Bottleneck in Pairwise Graph Generation
The most straightforward approach to building a signed graph is exhaustive pairwise classification: prompting an LLM for every possible pair of attribute values and recording the result as an edge. The challenge lies in how quickly the number of pairs grows.
For a graph covering 50 brands, 100 audience interests, 10 platforms, and 20 geographic locations, exhaustive pairwise classification already requires approximately 14,500 LLM calls. This number grows quadratically with the size of the attribute space: at 200 brands, 200 interests, 20 platforms, and 50 locations, the call count exceeds 100,000.
This makes exhaustive classification too expensive for routine use and too slow to support iterative development. It also raises a central challenge: if exhaustive evaluation is infeasible, how should the pairs to classify be selected? Addressing this limitation became a central Q1 technical goal: to identify a graph generation strategy that preserves the quality of LLM-based edge classification while remaining scalable.
4. Graph Generation Strategies Evaluated
4.1 Implementation A – Batched Brute-force Pair Classification
In the brute-force approach, all possible combinations of attributes are generated, and the LLM is asked to classify the expected performance relationship for each pair. This strategy provides broad coverage because every pair is evaluated. However, it introduces two important issues.
First, the prompt implicitly frames each pair as something worth relating, which can introduce framing bias. As a result, pairs that likely have no meaningful relationship may still receive a weak or speculative classification, making the generated graph overly dense.
Second, research has shown that such constrained output formats, such as +, -, or 0 to classify performance, systematically suppress LLMs’ natural reasoning processes and force premature commitment to answers (Tam et al., 2024; Yu et al., 2025).
To reduce the number of LLM calls, batch prompting was introduced so that multiple attribute pairs could be classified in a single request. While this improves efficiency, it may also amplify the second issue.
Potential issues and challenges:
Overly dense graphs with many spurious or low-confidence edges.
Computational cost explosion due to quadratic scaling of LLM calls.
Increased hallucinations from batch classifications and structured output.
4.2 Implementation B – Cluster-Based Pre-filtering
This implementation uses a hierarchical strategy to reduce LLM scoring calls without sacrificing edge quality. Attributes are embedded and clustered per modality using HDBSCAN (with optional UMAP reduction for larger datasets), then an LLM scores only cross-modality cluster pairs rather than all attribute pairs. Cluster pairs falling within a neutral band are discarded, and attribute-level edges are considered only from surviving cluster combinations, reducing the evaluation space from O(n²) attribute pairs to O(k²) cluster pairs, where k is smaller than n.
Potential issues and challenges:
Sensitivity to embedding and clustering quality.
The threshold defining the neutral band is a manual parameter. Set too aggressively, it discards semantically meaningful cluster pairs; set too loosely, it recovers the density problem of Implementation A.
Increased hallucinations from batch classifications and structured output.
5. System Overview
The pipeline consists of two services operating in sequence that will be integrated in one service in the future. The Graph Generator Service takes a versioned list of marketing attributes and produces a signed graph encoding compatibility relationships between them. The Synthetic Dataset Generator then consumes that graph to produce labelled campaign performance datasets ready for model training.
5.1 Graph Generator Service
The edges generator receives a versioned list of attributes as input and produces a CSV file representing signed relationships between attributes.
The output is a signed edge file in CSV format, where each row represents a relationship between two attributes and an inferred performance signal.
The output schema includes the following fields:
source: the origin attribute in the relationship
target: the related attribute connected to the source
rel: the relationship type, typically indicating the modalities involved in the connection (for example, audience_to_brand)
sign: the inferred polarity of the relationship, where + indicates a positive association and – indicates a negative association.
General Configuration
Both implementations share a fixed set of parameters to ensure that observed differences in graph quality are attributable to the generation strategy rather than configuration.
Parameter
Value
LLM model
gemini-3.1-flash-lite-preview
Embedding model (when applied)
gemini-embedding-2-preview
Temperature
0.15
Max attributes per batch call
25
Concurrent LLM workers
7
Execution configuration
Temperature is set to 0.15 (near-deterministic but not fully greedy) to reduce output variance across runs while preserving enough flexibility for the model to handle pairings without defaulting mechanically to the same answer.
The batch size of 25 attributes per call balances context window efficiency against classification degradation: larger batches for quality classification tasks reduce LLM calls but increase the risk of the model losing focus or producing inconsistent signs across a long list of targets (Van Can et al., 2025).
5.1.1 Brute force approach pipeline
The brute-force pipeline enumerated all valid cross-modality attribute pairs from the catalogue and submitted each to gemini-3.1-flash-lite-preview for sign classification. To reduce the total number of API calls, pairs were grouped into batches rather than submitted individually. Each pair was assigned one of three labels: positive (+), negative (-), or neutral (0). Neutral pairs were discarded from the output to simulate real data sparsity.
Prompting strategy
The attribute-level classification prompt defines labels in terms of real campaign behaviour. A positive label (+) requires an obvious and natural strategic fit, strong audience overlap, clear brand alignment, and a combination that demonstrably drives higher ROI. A negative label (-) applies when the combination creates friction, wastes ad spend, targets conflicting demographics, or causes brand dissonance. Neutral (0) covers pairs with no meaningful performance signal, standard baseline performance, or combinations that are simply unrelated.
To reduce speculative classifications, the prompt includes an explicit evaluation mindset guardrail: the model is instructed not to invent creative edge cases to justify a pairing, to treat any combination requiring unconventional reasoning as a mismatch, and to default to neutral when uncertain. This guardrail is critical in a batch classification setting, where the model might otherwise be biased toward generating signal on every pair it evaluates. Full prompt templates are provided in Appendix A.
Brute-force flow diagram
5.1.2 Cluster-based approach pipeline
The clustering pipeline operated on the same attribute catalogue and followed a four-stage process.
Embedding: attributes were embedded using the latest Gemini embedding model (gemini-embedding-2-preview)
Per-modality clustering: attributes were clustered independently per modality using HDBSCAN, preceded by UMAP dimensionality reduction for modalities with 20 or more attributes, and PCA as a fallback for smaller groups above the dimensionality threshold. Noise points were reassigned to the nearest cluster centroid and flagged to be handled later in the attribute-level expansion.
Cluster pairs batch scoring: each cluster was labelled by the LLM and cross-modality cluster pairs were scored on a 0–10 compatibility scale. Pairs scoring above 7.0 were classified as positive, below 4.0 as negative, and pairs in the neutral band were discarded, reducing the attribute-pair evaluation space from O(n²) to the subset contained within surviving cluster combinations.
Attribute-level edge expansion: for each surviving cluster pair, attribute-level edges were inferred through batched LLM classification. A shortcut was applied to reduce LLM calls: cluster pairs with extreme scores (above 9.0 or below 2.0) had their attribute-level edges auto-assigned directly from the cluster polarity, bypassing LLM classification entirely. However, noise-flagged attributes were excluded from this shortcut and routed back to LLM calls. This avoids propagating unreliable assignments.
Prompting strategy
Implementation B uses four distinct prompt types. A cluster labelling prompt assigns each cluster a short label and summary, used as input to the next step. A cluster compatibility prompt scores cross-modality cluster pairs on a 0-10 numeric scale, enabling the neutral band filter and extreme score shortcuts that reduce downstream LLM calls.
The attribute-level classification prompt is shared with Implementation A and uses the same guard-rails. Sharing prompts across both implementations keeps the final edge classification consistent, making the two approaches as comparable as possible given the non-deterministic nature of LLM outputs. Full prompt templates are provided in Appendix B.
Configuration
Parameter
Value
Clustering algorithm
HDBSCAN (Euclidean)
Dimensionality reduction
UMAP (≥ 20 attributes), PCA (< 20)
UMAP output dimensions
15
HDBSCAN min cluster size
2
UMAP distance metric
Cosine
Min attributes to cluster
15
Embedding normalisation
L2
Positive cluster threshold
≥ 7.0
Negative cluster threshold
≤ 4.0
Neutral band (discarded)
4.0 – 7.0
Auto-inherit positive threshold*
≥ 8.0
Auto-inherit negative threshold*
≤ 3.0
Parameters
*The auto-inherit threshold must be tuned according to how well the attribute catalogue clusters. When clustering quality is high and cluster boundaries are semantically meaningful, a looser threshold allows more attributes to inherit their cluster-level scores with confidence. When clustering is poor or produces noisy assignments, a stricter threshold is necessary to avoid propagating unreliable scores down to the attribute level.
Cluster-based implementation flow diagram
5.2 Synthetic Dataset Generator
The synthetic dataset generator is an API-driven service that creates synthetic marketing campaign records from a signed graph and a set of configurable sampling constraints. It is designed to support multiple data science teams and use cases by allowing users to generate tailored datasets from a shared graph input while keeping the process reproducible through versioned inputs and outputs.
Operational Architecture
The generator runs as a Cloud Run Job triggered asynchronously by a dedicated API service, also deployed on Cloud Run. The API handles request validation, stores versioned inputs and configurations in Cloud Storage Buckets, and exposes endpoints to monitor job status and retrieve results.
Separating request handling from sampling execution means long-running jobs do not block the API layer, and both services scale independently. All inputs, configurations, and outputs are versioned and persisted in the cloud, making every run fully reproducible.
Input
The service takes as input a versioned edges.csv file and a generation configuration defining attribute-type ranges, sample counts per label, and target positive and negative edge-fraction bounds.
Sampling Algorithm
The core of the generator is a parallelised simulated annealing sampler. Given a signed graph and a set of constraints as input, the sampler starts from a random subgraph and iteratively proposes node swaps, accepting or rejecting each swap based on how well the resulting subgraph satisfies the hard constraints. Worse solutions can be accepted with a probability that decreases over time, controlled by a temperature schedule. This allows the search to escape local optima early in the run and converge toward valid solutions as temperature drops.
Each sample is generated independently with its own random seed, and all samples within a label class are produced in parallel across CPU workers, making generation time roughly constant regardless of the number of samples requested.
Configuration Parameters
Sampling parameters are divided into hard and soft constraints. Hard constraints are strictly enforced and, if too strict or mutually incompatible, generation will fail. Soft constraints are optimisation objectives, therefore the sampler maximises or minimises them during the search but does not block generation if they are not fully met.
The service supports the following parameters:
1. Type ranges (hard constraint): define how many nodes of each attribute type must appear in each sample.
For example, setting geo location to [1, 2] and audience to [2, 4] means every generated record will contain one or two geos and between two and four audience attributes.
2. Label fraction ranges (hard constraint): define the distribution of the samples across different performance labels.
In a real marketing dataset, campaign performance metrics are classified into discrete labels or classes such as positive, average, and negative. The synthetic generator mirrors this structure: each label is assigned its own target range for positive and negative edge fractions, which the sampler must satisfy when constructing each record.
Two parameters define the edge composition of each class:
pos_frac_range: the acceptable interval for the fraction of positive edges in the sampled subgraph.
neg_frac_range: the acceptable interval for the fraction of negative edges in the sampled subgraph.
A third parameter, num_samples, sets how many records to generate for that class, which directly controls class balance in the final dataset.
For example, a positive class configured with pos_frac_range: [0.2, 1.0] and neg_frac_range: [0.0, 0.3] will only accept subgraphs where at least 20% of edges are positive and no more than 30% are negative. A negative class does the opposite, requiring low positive and high negative fractions to represent structurally conflicting combinations. An average class targets overlapping mid-range fractions for both, capturing ambiguous or mixed-signal configurations.
3. High and low-priority type pairs (soft constraint): rules to favour or discourage specific samples.
High-priority type pairs instruct the sampler to favour subgraphs where specific modality combinations, such as brand-to-audience or platform-to-geo, are well connected. This is useful when a downstream model needs to learn from dense signals between particular attribute types.
Low-priority type pairs do the opposite by discouraging connections between selected modalities, allowing the generator to simulate scenarios where certain attribute combinations are structurally sparse or irrelevant.
4. Simulated annealing sampler parameters: control the behaviour of the sampler algorithm itself.
max_iters sets the total iteration budget per sample, trading off generation speed against solution quality
start_temp and end_temp define the annealing temperature schedule, controlling how aggressively the algorithm explores versus exploits candidate subgraphs early and late in the search
patience sets how many iterations without improvement trigger an early stop, avoiding wasted compute when a good solution has already been found
proposals_per_step controls how many node swap candidates are evaluated at each iteration, allowing the search to cover more of the graph space per step
Taken together, these controls mean the same generator and the same signed graph can produce datasets with very different statistical profiles, varying class balance, edge density, modality coverage, and noise levels. Multiple different datasets can simply be built by changing the configuration, without rebuilding any part of the pipeline.
Output and Evaluation
After sampling, the service evaluates the generated samples against the requested edge-fraction targets. The current evaluation includes:
average positive and negative edge distribution
average deviation from target ranges
in-range rate, i.e. the proportion of generated samples that satisfy the requested bounds
The final output is written as a versioned JSONL file, where each line represents one synthetic row of a marketing campaign performance dataset. Each record contains the sampled attributes grouped by type and a label field indicating the intended performance class.
6. Methodology
6.1 Signed Graph Generation
The evaluation of the graph generation pipeline is organized around four complementary angles: computational efficiency, sign consistency, controlled correctness, and dataset generation feasibility.
Computational efficiency
The primary comparison between Implementation A and Implementation B is conducted on the same versioned attribute catalogue. The following metrics are recorded for each run:
total LLM calls made
total edges generated
LLM calls per final edge
total execution time
sign distribution of the output graph (positive, negative, neutral)
This measures how much of the quadratic cost is recovered by the cluster pre-filter strategy without sacrificing edge coverage.
Sign consistency
For all attribute pairs that appear in both implementations’ outputs, the sign assigned by each approach is compared directly.
The agreement rate across overlapping pairs serves as a proxy for signal reliability: if two independent generation strategies (one exhaustive, one cluster-filtered) converge on the same sign for a pair, that convergence is evidence the edge reflects a real compatibility signal rather than LLM noise. Disagreements are analysed by modality pairs to identify which attribute combinations produce the most inconsistent classifications.
Controlled correctness check
A small auxiliary attribute catalogue is constructed manually, containing three categories of pairs:
Obviously positive: pairings with clear and unambiguous marketing alignment (e.g. Nike + Sports Enthusiasts, Sephora + Beauty & Personal Care)
Obviously negative: pairings with clear brand or audience conflict (e.g. Luxury brand + Budget shoppers)
Ambiguous: pairings with no strong prior, where reasonable disagreement is expected
Graph density and sampling feasibility
A graph that is semantically correct but structurally too sparse cannot support constrained dataset generation. This angle evaluates, by running a diagnose script, whether the generated graph is dense enough to serve as a valid input to the synthetic dataset generator under realistic label configurations.
6.2 Synthetic Dataset Generation
Unlike the graph generator, the synthetic dataset generator is not itself the subject of a controlled experiment. Its correctness is structural: given a valid signed graph and a feasible set of constraints, it either produces samples that satisfy the requested edge-fraction bounds or it does not. The meaningful question is therefore not whether the generator works, but what it made possible — and whether the datasets it produced were usable enough to drive real modelling work.
Diagnosing the generated synthetic dataset
Per-label sample quality was assessed using three aggregate metrics:
Average edge distribution: the mean positive and negative edge fractions across generated samples;
Average deviation: the mean distance between the observed edge fractions and the nearest boundary of the requested target ranges;
In-range rate: the proportion of generated samples whose positive and negative edge fractions simultaneously fall within the configured bounds.
Together, these measurements characterise whether the sampler is producing structurally valid datasets for a given graph and parameterisation, and serve as the primary signal for deciding whether a generation run is fit for downstream use.
7. Results — Graph Generator
7.1 Small Catalogue
Brute force approach
The brute force run against the 27-node small catalogue (10 brands, 10 audiences, 4 countries and 3 platforms) produced 191 valid edges from 252 possible cross-modality pairs in 54.1 seconds across 54 LLM calls.
LLM signal
Distribution
Positive edges
147 (77%)
Negative edges
44 (23%)
Neutral / filtered
61 (24%)
Edges distribution
Graph quality
Inspecting the output confirms the graph encodes meaningful marketing compatibility signals. A few representative examples across edge types:
Obvious positives — clear strategic fit correctly assigned +:
Athletes → Nike (audience → brand)
Winter Sports Enthusiasts → The North Face (audience → brand)
Obvious negatives — clear conflict or mismatch correctly assigned -:
Budget Shoppers → Rolex (audience → brand)
Salesforce → K-Pop Fans (audience → brand)
Winter Sports Enthusiasts → Brazil (audience → geo)
Rolex → Wish (brand → platform)
Salesforce → TikTok (brand → platform)
Neutral / filtered — average or no meaningful signal correctly discarded:
Women 18-34 → all geos
Fitness Enthusiasts → Dior (audience → brand)
Beauty Enthusiasts → Red Bull (audience → brand)
Dior → TikTok (brand → platform)
Cluster-based approach
The cluster-based run completed in under 20 seconds with approximately 40 LLM calls — a 26% reduction versus brute force — while producing a graph of comparable negative-edge quality.
Metric
Brute Force
Cluster-Based
LLM calls
54
40
Positive edges
147
125
Negative edges
44
41
Total edges
191
166
LLM Calls versus generated edges
Clusters found in this experiment:
Luxury Fashion Houses → Dior, Chanel
Global Lifestyle Brands → The North Face, Havaianas, Vans, Nike, Salesforce (outlier not correctly assigned by HDBSCAN)
Small catalogue constraints
The 26% reduction in LLM calls is modest and reflects a fundamental limitation of applying the cluster-based approach to small catalogues. With few attributes per modality, the embedding space is too sparse for HDBSCAN to form meaningful dense regions, producing singleton or near-singleton clusters that compress the attribute space very little. As a result, most savings come from the neutral band filter rather than cluster compression.
With only 10 attributes per modality, the catalogue sits at the lower boundary of what density-based clustering can reliably resolve. At this scale, several compounding factors make cluster assignments untrustworthy: the geometric structure of the embedding space is fragile, meaning clusters may be more likely driven by embedding noise than genuine semantic similarity.
For a 10-attribute modality, all 10 clusters are singletons, and the cluster compatibility scoring step evaluates the same number of pairs as brute-force approach classification would.
7.2 Evaluating LLM-Efficiency for a Larger Catalogue
Experiment setup
The large catalogue experiment ran both implementations against a versioned catalogue of 160 attributes: 60 brands, 60 audiences, 10 platforms, and 30 countries. This scale produces a cross-modality pair space of 8700 possible pairs, making brute-force classification substantially more expensive than in the small catalogue case and providing a more meaningful surface for evaluating cluster-based compression.
Both runs used identical configuration: gemini-3.1-flash-lite-preview at temperature 0.15, batches of 25 attributes per LLM call, and 7 concurrent workers.
Computational Efficiency
Metric
Brute-force
Cluster-Based
Δ
Batches per LLM call
25
25
—
Total LLM calls
570
265
−53%
Total execution time
553s
274s
−50.5%
Total edges generated
5,131
4,649
−9.4%
Edges per LLM call
~7.67
~17.5
+128%
Positive edges
4,372
4,092
—
Negative edges
759
557
—
LLM Calls versus generated edges
The cluster-based approach cut both LLM call volume and wall-clock time by approximately half while retaining 90.6% of total output edge volume. The improvement in edges per LLM call — from 7.67 to 17.5 — directly reflects the mechanism at work: by routing calls only toward attribute pairs that survive the cluster-pair pre-filter, each batch operates in a denser region of the relevance space rather than sweeping uniformly across the full pair surface.
Agreement metrics between the cluster-based vs brute-force graph
Metric
Value
Common edges (present in both outputs)
3,476 of 5,131 — 67.7% of brute-force
Sign agreement on common edges
3,406 of 3,476 — 98.0%
Sign mismatches
70
Edges missing from cluster-based output
1,655
Edges unique to cluster-based output
1,173
Edge recall vs. brute-force
67.7%
Sign precision on recovered edges
98.0%
Agreement metrics between the cluster-based vs brute-force graph
Where both approaches generate an edge for the same attribute pair, they agree on its polarity in 98.0% of cases. This is the most important quality signal in the comparison: it confirms that the cluster-based approach does not systematically distort relationship sign. The gap between the two outputs is almost entirely a coverage gap and not a correctness gap. The 32.3% of brute-force edges absent from the cluster-based output (1,655 pairs) represent pairs that the pre-filter chose not to route to the LLM, not pairs that were scored incorrectly.
The 1,173 edges unique to the cluster-based graph shows pairs that the cluster approach evaluated and retained, but which brute-force scored as neutral and discarded. This represents a mild over-generation effect — the cluster-pair pre-scoring occasionally surfaces regions of the pair space that brute-force ultimately considers uninformative.
Catalogue scaling projection for even larger catalogues
Brute-force LLM call volume scales as O(N²). For each pair type we have:
So doubling the catalogue quadruples the cost. The cluster-based approach breaks this curve: by replacing exhaustive pair evaluation with a cluster-pair pre-scoring step, call growth tracks closer to O(N · √N), since cluster count grows sub-linearly relative to attribute count.
Using these as reference points, the projected call volumes for larger catalogues are:
Catalogue Size
Est. total pairs
BF est. LLM calls
CB est. LLM calls
Est. savings
160 attributes (current)
8,700
570
265
53%
320 attributes (2×)
~34,800
~2,280
~750
~67%
800 attributes (5×)
~217,500
~14,250
~2,960
~79%
1,600 attributes (10×)
~870,000
~57,000
~8,400
~85%
Projected call volumes for larger catalogues
7.3 Graph Density Analysis
Before feeding the attribute graph into the dataset generator, a diagnostic pass measures the empirical distribution of edge types across randomly sampled subgraphs. This step is necessary because the Synthetic Dataset generator needs explicit density constraints grounded in the actual structure of the graph, not assumed.
This density evaluation randomly samples subgraphs according to the same type quotas expected by the Synthetic Dataset Generator but with zero edge-fraction constraints, then measures the fraction of positive, negative, and absent fractions you actually get from the graph’s natural structure.
Results over the large catalogue graph, sampling 4,000 subgraphs (160 nodes, 4,092 positive edges, 557 negative edges) are:
Metric
Mean
p5
p25
p50
p75
p95
pos_frac
0.363
0.156
0.267
0.333
0.444
0.600
neg_frac
0.036
0.000
0.000
0.022
0.056
0.133
absent_frac
0.602
0.333
0.524
0.611
0.700
0.810
Results over the large catalogue
The distribution reflects a structurally asymmetric graph: roughly 36% of edges within a typical subgraph are positive, 60% are absent, and only 3.6% are negative.
Based on the observed percentiles, the diagnostic suggests the following label_params for the three training label classes:
Building a reliable signed graph is non-trivial and it is precisely the problem the graph generator in this study was designed to solve. Before this service was available, two graphs were assembled: an initial one built with direct LLM assistance, and a later iteration grounded in both LLM outputs and real data. Run against both, the generator produced 49 valid datasets. This section demonstrates one representative example end-to-end.
8.1 Impact and Downstream Use
With AI-ready real campaign data only becoming available late in the quarter, and still subject to significant coverage limitations, synthetic data was the primary available input for model training and architecture evaluation. The multiple datasets generated across both graphs unblocked multiple experiments that would otherwise have had to wait for real data to be ready.
The following teams and experiments consumed datasets produced by the generator, each linking to the corresponding technical report where results are documented in detail:
Dataset: Two versions of the V27 (13k rows and 90k) — schema: brand, interest-and-other-attributes, platform, geo, campaign objective, device, gender-generation, media-buy attributes
Use: Training and benchmarking ML models, spanning both traditional and deep learning architectures, to classify campaign performance as positive, average, or negative before any real budget is committed
Dataset: Multiple versions of the V28 (baseline, inverse pairs, middle pairs and different configurations of noises) — schema: brand, audience, creative, platform, geo
Use: Benchmarking federated learning against centralised training for multimodal campaign outcome classification, across varying numbers of clients, noise levels, and cross-modal relationship complexity; synthetic data was essential here as raw campaign data cannot be shared across organisational boundaries by design
Use: Applying Google AlphaEvolve’s evolutionary search framework to automatically discover superior neural architectures, loss functions, and hyperparameters for the campaign performance prediction model, using the internal base model as seed — across synthetic datasets of varying difficulty levels and real data as final validation.
Model leaderboard
To consolidate results across the multiple teams and experiments consuming synthetic datasets, a shared Model Leaderboard UI was built as a Streamlit application. The app ingests evaluation logs produced by every model execution and surfaces them in a unified, filterable interface, ranked by the score.
The dataset version selector on the left panel allows direct comparison across synthetic dataset generations, making it straightforward to assess how model performance evolved as dataset quality, difficulty, and schema improved over time. This made it possible to track progress across teams working independently, identify which architectures and training configurations performed best on a given dataset version, and establish a common benchmark reference point without requiring any manual coordination between experiments.
Model Leaderboard UI
8.2 Real Use Case Example
Evaluating graph edge density to choose the fraction range constraints
Following the diagnostic procedure described in Section 6.1, a density analysis was conducted on the purpose-built graph comprising 746 nodes, 36,420 positive edges, and 40,629 negative edges. Across 10,000 randomly sampled subgraphs drawn using the prescribed type quotas and no edge-fraction constraints, the following distribution was observed:
Metric
Mean
p5
p25
p50
p75
p95
pos_frac
0.175
0.090
0.133
0.170
0.210
0.275
neg_frac
0.210
0.116
0.167
0.206
0.248
0.319
absent_frac
0.615
0.513
0.582
0.625
0.654
0.690
Graph density analysis
The results show a moderately sparse graph (roughly 61% of node pairs carry no edge), where positive and negative edge densities are naturally close to each other. This leaves limited room to position label boundaries without overlap. If the pos and neg label constraints overlap significantly, the generator produces structurally indistinguishable samples, undermining the dataset’s discriminative value.
Based on the observed percentiles, the following label_params were derived:
The label bounds used in this run were set slightly wider than the diagnostic-suggested baseline to provide the sampler with more feasible solution space, particularly for the avg class where the diagnostic revealed a narrow overlap region between positive and negative fractions:
The generation run achieved a 100% in-range rate across all three label classes, with zero average deviation from the requested bounds. Every one of the 13,000 generated samples satisfies its configured edge-fraction constraints, confirming that the calibrated label bounds derived from the density diagnostic were both feasible and well-matched to the graph’s structural properties. The sampler converged cleanly across all label classes, with no samples requiring rejection or resampling.
{
"brand": ["aperol"],
"campaign-objective": ["Video views"],
"campaign-buying-type": ["Auction"],
"platform": ["audience network rewarded video"],
"device": ["mobile web iphone"],
"geo": ["Costa Rica", "Gambia"],
"gender-generation": [
"Baby Boomers",
"Gen X",
"Gen Z",
"Male",
"Millennials",
"Gen Alpha"
],
"interest-and-other-attributes": [
"Entrepreneurship and Sales",
"Lingerie and Intimate Wear",
"Luxury German Automobiles",
"Sci-Fi and Space Exploration",
"Weddings and Marriage"
],
"media-buy-billing-event": ["Thruplay"],
"media-buy-cost-model": ["Automatic objective"],
"label": "pos"
}
9. Limitations and Open Issues
The current system has several limitations that should be addressed in future work:
Lack of empirical validation for generated edges
The graph’s edges are assumed to encode reliable marketing compatibility signals, but this assumption is never empirically tested against real campaign outcomes. The controlled correctness check validates only a small set of manually curated obvious cases, covering clear positives and clear negatives. This confirms the LLM can handle unambiguous pairings, but says nothing about the accuracy of the thousands of borderline edges that make up the bulk of the graph.
Graph over-connectivity in brute-force
The brute-force approach frames every attribute pair as worth relating, introducing framing bias and producing denser graphs than the real compatibility space warrants. Low-confidence edges that a more selective approach would discard pass through classification, potentially propagating spurious marketing relationships into downstream datasets and models.
HDBSCAN sensitivity
The cluster-based approach is sensitive to embedding space density and HDBSCAN parameterisation. At small catalogue sizes, sparse embedding spaces produce fragile cluster geometries that may reflect noise rather than genuine semantic similarity. These parameters are not currently exposed as configurable inputs, but this can be addressed in future implementations.
Synthetic dataset generator requires a dense graph
The sampler depends on the input graph having sufficient positive and negative edges to satisfy configured fraction bounds. Sparse graphs (particularly in negative edges) directly constrain the feasible label configuration space and can cause generation to fail or degrade.
10. Recommendations for Next Quarters
Productionise the cluster-based approach with safe self-service access
The cluster-based implementation has not yet been integrated into the production API. Productionising it should be accompanied by the guardrails necessary for safe self-service use: per-run call budgets, pre-flight cost estimation from catalogue size, input validation, and retry logic. Both steps should be treated as a single delivery, since opening the service to other teams without usage controls risks runaway API consumption regardless of which generation strategy is exposed.
Alongside this, a set of currently hardcoded parameters should be exposed as service configuration. Clustering behaviour, such as HDBSCAN minimum cluster size and UMAP output dimensions, directly affects edge quality and should be tunable for teams with atypical catalogue structures. LLM customisation, such as model selection, temperature, prompts, and batch size, should also be enabled, allowing teams to trade cost against output quality based on their use case.
Run the graph generator against the production catalogue
The 49 datasets documented in this report were generated from manually assembled graphs. The next step is running the graph generator against the production attribute catalogue, which would constitute the first end-to-end test of the full system under real operational conditions.
Grounded graph generation with a Knowledge Augmented Generation framework
Rather than relying solely on the LLM’s pre-trained knowledge to score attribute pairs, a KAG framework would construct a structured knowledge base from real sources. This knowledge base will be exposed to the LLM through a context management layer at classification time enabling grounded decisions. When scoring a pair like Brand X ↔ Audience Y, the LLM would receive retrieved context from multiple grounded sources: historical campaign performance, expert rules and market considerations, and BAV survey data capturing empirical brand-audience affinities. This directly addresses the most critical pipeline limitations: the lack of empirical validation for generated edges.
Open source the synthetic data pipeline
Open sourcing the pipeline, or at a minimum the graph generation framework, would invite external contributions, surface edge cases from diverse catalogue structures, and position the team as contributors to the broader ML and marketing intelligence community.
11. References
Van Can, A. T., Aydemir, F. B., & Dalpiaz, F. (2025). One size does not fit all: On the role of batch size in classifying requirements with LLMs. In Proceedings of the 2025 IEEE 33rd International Requirements Engineering Conference Workshops (REW 2025) (pp. 30–39). IEEE.
Tam, Z. R., Wu, C.-K., Tsai, Y.-L., Lin, C.-Y., Lee, H.-Y., & Chen, Y.-N. (2024). Let me speak freely? A study on the impact of format restrictions on performance of large language models. arXiv:2408.02442. https://doi.org/10.48550/arXiv.2408.02442
Delahaye, D., Chaimatanan, S., & Mongeau, M. (2019). Simulated annealing: From basics to applications. In M. Gendreau & J.-Y. Potvin (Eds.), Handbook of Metaheuristics (Vol. 272, pp. 1–35). Springer. https://doi.org/10.1007/978-3-319-91086-4_1
Appendix A – Prompts for brute-force approach
_CLASSIFICATION_SCALE = """CLASSIFICATION SCALE:
- NEGATIVE SIGNAL (mismatch): The combination creates friction, wastes ad spend, targets conflicting demographics, or causes brand dissonance.
0 NEUTRAL or NO SIGNAL: No meaningful impact, standard baseline performance, or completely unrelated.
+ POSITIVE SIGNAL (synergy): The combination clearly enhances the campaign, has strong audience overlap, and drives higher ROI.
EVALUATION MINDSET:
Do not invent creative edge cases to make a pairing work.
If a pairing requires mental gymnastics or an unconventional strategy to succeed, it is a mismatch (-).
Assign (+) only when the strategic fit is obvious and natural.
Default to (0) if unsure.
"""
python
BATCH_CLASSIFICATION_PROMPT = (
"You are a social media campaign expert.\n\n"
"TASK INTRODUCTION\n"
"You will evaluate how a primary marketing campaign attribute relates to "
"a set of other attributes from different categories.\n\n"
"PRIMARY ATTRIBUTE: \"{primary_attr}\"\n"
"CATEGORY: {primary_mod}\n\n"
"For each numbered attribute below, decide whether pairing it with the primary attribute "
"in a marketing campaign creates synergy (+), a mismatch (-), or neither (0).\n\n"
"ATTRIBUTES TO EVALUATE:\n"
"{attr_list}\n\n"
+ _CLASSIFICATION_SCALE
+ "\n\n"
"RESPONSE FORMAT:\n"
"Return ONLY a JSON object mapping each attribute's BRACKET NUMBER to its sign.\n"
"Use the numbers shown in [brackets] before each attribute. Do NOT use attribute names as keys.\n\n"
"Example — if attributes are \"[1] Attribute-B\", \"[2] Attribute-C\", \"[3] Attribute-D\":\n"
"{{\"1\": \"+\", \"2\": \"-\", \"3\": \"0\"}}\n\n"
"JSON OUTPUT:"
)
Appendix B – Prompts for cluster-based approach
ENRICHMENT_PROMPTS = {
"Brand": (
"Describe {value} across these dimensions, being specific about "
"what makes it DIFFERENT from other brands. Be concrete and avoid "
"generic marketing language.\n\n"
"1. Core identity: what it fundamentally stands for (not a tagline)\n"
"2. Who would NEVER be its customer (anti-audience)\n"
"3. Price tier vs direct competitors "
" (budget / mid / premium / luxury)\n"
"4. Cultural associations, lifestyle, or subculture it belongs to\n"
"5. Any restrictions, controversies, or advertiser safety concerns.\n\n"
"Answer in 6-7 sentences total."
),
"Audience": (
"Describe this audience segment for marketing targeting purposes.\n"
"Be specific and differentiating — assume this description will be used\n"
"to separate this segment from all other segments in a clustering algorithm.\n\n"
"Cover these dimensions:\n"
"1. Demographics & income: age range, income tier, life stage\n"
"2. Purchase behavior: price sensitivity, brand loyalty, research depth\n"
"3. Core motivation: what fundamentally drives their purchases?\n"
"4. Digital behavior: which platforms they actively use vs avoid\n"
"5. What they are NOT: name 2 audience types they are commonly confused\n"
" with, and explain precisely what distinguishes them\n\n"
"The last point is critical — explicitly contrast this segment\n"
"against its nearest neighbours.\n\n"
"Audience segment: {value}"
),
"Geo": (
"Describe {value} for digital advertising targeting purposes.\n"
"Be balanced and specific — describe the FULL population, not just "
"the most attractive or dominant segment.\n\n"
"Cover these dimensions:\n"
"1. Income distribution: what proportion of the online population is "
" budget/price-sensitive, mid-income, premium, and high-income? "
" Name specific segments that exist (e.g. deal-hunters, middle class, "
" luxury consumers) — do not flatten this into a single tier.\n"
"2. Platform landscape: which platforms dominate, which are absent or "
" blocked, and which are growing.\n"
"3. Consumer attitudes: cultural attitudes toward advertising, brand "
" trust, and price sensitivity across different segments.\n"
"4. Language and regional fragmentation: languages spoken, regional "
" differences in purchasing behaviour.\n"
"5. Regulatory or content restrictions advertisers face here.\n\n"
"Answer in 5-6 sentences total. Do not focus only on high-value "
"consumers — the description must reflect the full spectrum."
),
"Platform": (
"Describe {value} across these dimensions, focusing on what makes it "
"DISTINCT from other digital platforms for advertisers.\n"
"- Core user demographic (age, income, intent)\n"
"- Content format and consumption behaviour (passive scroll vs active search)\n"
"- Price tier of content that performs well: explicitly mention whether "
" budget/dupe/deal content thrives alongside premium content, or if the "
" platform skews exclusively toward one end.\n"
"- Ad formats available and their typical performance characteristics\n"
"- Categories of brands that perform well vs categories that underperform\n"
"- Brand safety profile and content moderation reputation\n"
"Answer in 4-5 sentences total."
),
}
python
ENRICHMENT_FALLBACK_PROMPT = (
"Describe '{value}' (type: {attribute_type}) focusing on what makes it DISTINCT "
"from similar entities: who it targets, who it excludes, price positioning, "
"cultural associations, and any restrictions or controversies. "
"Be concrete and avoid generic marketing language. Answer in 4-5 sentences."
)
python
CLUSTER_LABEL_PROMPT = """
You are a marketing strategy expert.
The following {modality} attributes were grouped together based on semantic similarity:
{members}
Provide in JSON with keys "label" (2-4 words), "summary" (one sentence),
"characteristics" (array of exactly 3 strings).
"""
python
CLUSTER_COMPAT_BATCH_PROMPT = """You are a media planning expert. Score how well a source cluster performs with each target cluster in an ad campaign.
SOURCE — {modality_a}: {label_a}
{desc_a}
Members: {members_a}
TARGET CLUSTERS:
{targets}
For each cluster target, return a compatibility score (0–10) or null if the combination is implausible, prohibited, or has no shared business context.
- 0–3 : Negative — friction, conflicting demographics, brand dissonance
- 4–6 : Neutral — baseline / no meaningful impact
- 7–10: Positive — clear synergy, strong audience overlap, higher ROI (only assign 10 if you believe all attributes in the target cluster would perform well with the source cluster)
- null : Not a viable campaign combination (not the same as a low score)
RESPONSE FORMAT:
Return ONLY a JSON object mapping each cluster's BRACKET NUMBER to its compatibility score (integer 0–10) or null.
Use the numbers shown in [brackets] before each cluster. Do NOT use cluster names as keys.
Example — if clusters are "[1] Luxury Fashion Houses", "[2] Mass Market Brands", "[3] Sports Apparel":
{{"1": 8, "2": 3, "3": null}}
JSON OUTPUT:
"""
Unlocking machine learning experiments across multiple teams with a synthetic data pipeline grounded in marketing knowledge
Training Machine Learning (ML) models for marketing usually starts with a hard requirement: labelled data that links campaign settings and attributes to actual performance outcomes. You collect campaigns, look at what combinations of brand, audience, platform, and geography performed well, and train a model to learn from those patterns.
In theory, that sounds straightforward, but in practice, real data is hard to clean and structure, arrives slowly, takes time to accumulate and only reflects combinations you’ve already run. If you’ve never targeted a certain audience on a certain platform for a certain brand, that example simply doesn’t exist in the dataset. And if multiple teams are waiting on that data before they can even begin experimenting, progress stalls fast.
We ran into exactly that problem.
We needed a way to start training and benchmarking marketing ML systems before AI-ready real campaign data was available at a useful scale. So instead of waiting for the data, we built a synthetic data pipeline that could generate realistic, labelled training data grounded in how marketing actually works.
That pipeline ended up unblocking model experiments across multiple teams.
The Problem With Random Synthetic Data
Real campaign data is essentially rows of campaign attributes (brand, audience, location, platform, placement, creative, and many more) each labelled with how that combination performed. That’s what a model learns from.
This kind of data is easy to fake badly. You can always create random combinations of attributes and assign them labels. But for marketing, random is worse than useless if it ignores real-world compatibility. A luxury brand paired with bargain-hunting audiences, or a B2B enterprise software brand matched with a fashion lifestyle platform, doesn’t help an ML model learn. It teaches the wrong lessons.
So the challenge wasn’t just “generate fake data”. It was:
Capture that marketing knowledge in a structured, machine-readable form
Use that structure to generate realistic campaign configurations at scale
What we needed was a structured way to encode that compatibility: given any combination of campaign settings, does it make sense or not?
Encoding Marketing Knowledge as a Graph
We chose a versatile structure: a graph.
In a marketing knowledge:
Nodes represent attribute values for different modalities, such as brand, audience, platform, country, and any other factor that can influence the outcome of a campaign.
Edges represent compatibility between two attributes:
A positive edge (+) means the pair is expected to work well together within the same campaign.
A negative edge (-) means the pair is a bad fit, likely to damage the cohesiveness and the performance of the campaign.
No edge means there’s no meaningful signal. A neutral fit.
That gives us a machine-readable map of marketing relationships.
Some simple examples:
LinkedIn ↔ C-Suite Executives → positive
Luxury brand ↔ Budget shoppers → negative
Salesforce ↔ TikTok → negative
Adidas ↔ K-pop fans → positive
This structure worked well for three reasons:
It naturally captures many-to-many relationships
It’s easy to extend with new brands, audiences, and platforms
It’s interpretable enough for humans to inspect and validate
Once you have that graph, you can start generating synthetic campaign examples that are constrained by actual compatibility signals instead of randomness.
The Bottleneck: Building the Graph was Expensive
The obvious way to build this graph was to leverage the capabilities of Large Language Models to classify every possible pair of attributes from a catalogue of brands, audiences, geographies, and other marketing settings of interest.
That approach can work for small catalogues, such as 20 brands, 50 audiences, 10 countries, and 5 platforms. But those are not especially useful in practice, since ML models need data that is both diverse and high-volume.
As the catalogue grows, pairwise combinations quickly become a bottleneck. Even a moderately sized catalogue creates thousands of cross-modality pairs. As the number of attributes increases, the number of possible pairs grows quadratically. That made a brute-force approach too slow and too expensive for routine iteration. Even considering batch calls, like a primary attribute compared to a target list of attributes, it would still be too much.
So we needed a way to build the graph without evaluating the entire space of possible combinations.
But that creates an obvious dilemma: how do you find the important pairs without first checking them all?
Two Ways We Approached Graph Generation
To answer that question, we implemented and compared two graph generation strategies.
1. Batched brute-force pair classification
A truly naive strategy would have been to ask the LLM about every single attribute pair one by one, but we did not test that because it is clearly too inefficient to be practical.
Instead, for each valid cross-modality combination, we selected one primary attribute and asked the LLM to classify its relationship to a batch of up to 25 target attributes as positive, negative, or neutral.
This gave us a strong reference point: broad coverage with a simple implementation, useful for evaluating whether a more efficient method could preserve similar graph quality without the same cost.
2. Cluster-first graph generation
The second approach was designed to reduce the search space before asking the LLM to score anything.
Instead of classifying every attribute pair directly, we first:
embedded the attributes and applied UMAP for dimensionality reduction,
asked the LLM to batch score compatibility between clusters,
discarded neutral cluster pairs and their attribute combinations,
automatically assigned scores to attribute combinations derived from high-confidence cluster pairs,
and asked the LLM to batch classify only the remaining attribute pairs.
This turned a very large search space into a much smaller one, so the LLM spent time only where useful signals were more likely to exist.
For small catalogues, the efficiency gains are smaller because many attributes end up as singleton clusters, but the same architecture still applies.
What Happened When We Compared Them
On a larger catalogue of 160 attributes — 60 brands, 60 audiences, 10 platforms, and 30 countries — the cluster-based approach performed much better operationally.
Compared with brute force, it delivered:
53% fewer LLM calls
50.5% less execution time
90.6% of the total edge volume retained
More importantly, where both methods produced an edge for the same pair, they agreed on the sign 98% of the time. This shows that the cluster-based approach is not systematically changing the meaning of the relationships it recovered.
The main trade-off was coverage: some pairs found by brute force were filtered out before attribute-level scoring, likely around lower-signal or more borderline cases.
In practice, this gave us a much cheaper way to generate the graph while preserving the compatibility signal that mattered most.
The scaling advantage becomes even clearer when projected to larger catalogues:
Batched brute-force
Batched cluster-based
Catalog Size
**Total pairs ***
**LLM calls ***
**LLM calls ***
160 attributes
8,700
570
265
320 attributes (2×)
~34,800
~2,280
~750
800 attributes (5×)
~217,500
~14,250
~2,960
1,600 attributes (10×)
~870,000
~57,000
~8,400
Catalogue specifications
These are directional estimates extrapolated from the 160-attribute experiment. Actual call volumes will vary with catalogue structure, clustering behavior, and graph densities.
From Graph to Actual Training Data
Once we had a signed graph, the next problem was turning it into an actual labelled campaign performance dataset.
Each row in this dataset represents one synthetic campaign configuration (a combination of attributes drawn from the graph) along with a performance label: pos, neg, or avg. That label is the training target. It describes whether the overall campaign combination is expected to perform well, underperform, or land somewhere in between.
Important note: The label is not the same as a graph edge. Edges score pairs of attributes; the label scores the whole configuration, aggregated across all its edges signs.
Figure 1 – Example of row from the campaign performance dataset
This dataset is the output of the second service in the pipeline: the Synthetic Dataset Generator. Its job is to create synthetic campaign records from the graph while respecting configurable constraints such as:
how many attributes of each type should appear in each sample,
how many positive, negative, and average examples to produce,
and what proportion of positive vs. negative edges each label class should contain.
For example, a positive sample might require a relatively high fraction of positive edges and a low fraction of negative ones. A negative sample would do the opposite, while an average sample would contain more balanced fractions of both.
That gave us completecontrol of the dataset. The same graph could generate multiple datasets (with different class balances, difficulty levels, noise profiles, and schemas), just by changing configuration, not rebuilding the pipeline.
Simulated annealing: searching the graph efficiently
To find valid combinations to generate each dataset row efficiently, we used a parallelized simulated annealing sampler. The name comes from a steel mill process, where a material is heated and then cooled in a controlled way to reduce defects and settle into a more stable structure.
Our algorithm follows the same idea. It starts in a “hot” state, exploring many possible campaign configurations and even accepting imperfect ones early on. As it cools, it becomes more selective, swapping attributes in and out until each sample settles into a configuration that satisfies the requested constraints.
Downstream Impact and ML Experiments Unlocked
This service was not just a technical exercise. Its purpose was to unblock machine learning workstreams while real campaign data was still limited, not ready, or missing key combinations. Without it, multiple experiments would have been blocked.
The Synthetic Dataset Generator produced 49 synthetic datasets, built from multiple graph versions and configurations. Those datasets were used to both train and stress-test models across different teams and modelling approaches. Each dataset varied in class balance, difficulty, and noise to probe how models behaved under pressure. Experiments included:
Campaign performance prediction
Federated learning experiments
Architecture search and model benchmarking
Comparisons between fine-tuned LLMs and custom classifiers
We also built a shared model leaderboard so teams could compare results across dataset versions and training approaches without manual coordination.
That created a common experimental foundation before real data was fully ready.
What Synthetic Data Did (and Didn’t) Solve
Synthetic data was an accelerator, not a replacement for real data.
It let us:
start ML experiments earlier,
benchmark model architectures,
explore dataset schemas,
test class balance and difficulty settings,
and support teams that otherwise would have had to wait
But it also has several limitations:
The biggest one is that graph edges are still inferred, not directly validated against large-scale real campaign outcomes. We verified obvious cases, but many of the more ambiguous relationships remain assumptions generated from LLM reasoning rather than empirical evidence.
References
Van Can, A. T., Aydemir, F. B., & Dalpiaz, F. (2025). One size does not fit all: On the role of batch size in classifying requirements with LLMs. In Proceedings of the 2025 IEEE 33rd International Requirements Engineering Conference Workshops (REW 2025) (pp. 30–39). IEEE.
Tam, Z. R., Wu, C.-K., Tsai, Y.-L., Lin, C.-Y., Lee, H.-Y., & Chen, Y.-N. (2024). Let me speak freely? A study on the impact of format restrictions on performance of large language models. arXiv:2408.02442. https://doi.org/10.48550/arXiv.2408.02442
Delahaye, D., Chaimatanan, S., & Mongeau, M. (2019). Simulated annealing: From basics to applications. In M. Gendreau & J.-Y. Potvin (Eds.), Handbook of Metaheuristics (Vol. 272, pp. 1–35). Springer. https://doi.org/10.1007/978-3-319-91086-4_1
Every ad campaign generates data. Many practitioners use this data to answer “how” questions: How did performance trend this week? How did this campaign compare with last quarter? But this data contains something deeper. It captures what makes some campaigns succeed and others fail through the specific combination of elements that define them: the audience, geography, platform, creative, and device. The Campaign Intelligence Dataset is designed to uncover that insight at scale. It collects, enriches, and harmonises campaign data from multiple marketing platforms into a unified, AI-ready schema, while addressing technical challenges and respecting hard client and legal constraints related to data access, isolation, and privacy. The result is a continuously growing asset with millions of data points across all major markets, used to power multiple pods across WPP Research.
Turning Campaign Data into a Strategic AI Advantage
1. Motivation: The opportunity hidden in campaign data
Every ad campaign generates data, that apart from performance metrics, encode something very valuable most organisations overlook: what worked and what didn’t, across audiences, geographies, creatives, and platforms.
What creative approach resonates best with this audience?
Which zip codes should a luxury watch brand target?
How should spend shift when a platform’s algorithm changes?
These aren’t hypothetical questions. They are answerable, but only if the underlying data is structured to answer them. This is exactly the motivation behind the Campaign Intelligence Dataset that we have built in WPP Research.
The dataset achieves three things:
Collects campaign marketing data across platforms
Enriches it with derived fields useful for AI training
Harmonises it into a unified, AI-ready schema, while respecting all applicable legal and client constraints related to data access, isolation (as certain datasets can never be mixed with others), and privacy.
We refer to every input that shapes a campaign setup as a modality: a single dimension that influences whether an ad performs well or poorly. Examples of campaign modalities tracked today include geography, audience, platform, placement, creative, device, and spend.
A modality combination is one specific configuration drawn from each dimension. A simple example:
Geo: UK
Audience: Women 25–34
Platform & Placement: Instagram feed
Creative: 15-second sun-lit outdoor lifestyle video
Device: iPhone
In practice, the values of each modality can vary significantly across campaigns. Geo can be encoded as a country or a granular set of zip codes. Audience can be expressed in terms of simple demographics of be a free-text highly descriptive paragraph. Creatives can range from simple images and banners to short cinematic-style videos.
Each combination is also attached to outcome metrics like impressions, clicks, conversions, and revenue, that measure how the given combination performed.
2. Challenges: From Access Hurdles to Privacy at Scale
Enterprise campaign data doesn’t arrive clean, unified, or ready for training. Between raw platform tables and an AI-ready dataset sits a gauntlet of practical problems.
Getting to the data is the first obstacle. Source data lives behind governance gates. Access requires building trust with stakeholders, making a clear case for the dataset’s value, and operating entirely within approved cloud environments and client security policies. There are no shortcuts here; this is relationship work as much as engineering work.
Once you have access, finding the signal is its own challenge. Marketing platforms expose dozens of tables with hundreds of fields. Which ones are useful for model training? Which require feature engineering, such as extracting targeting attributes from raw specs, mapping age ranges to generational cohorts, and normalising geography hierarchies? Which joins introduce duplicates? Which campaigns should be excluded entirely? Thankfully, AI has enabled tools that can help us tackle such challenges in a much more efficient way. WPP Research has multiple agentic initiatives to develop and and leverage such tools. Examples include:
The Data Discovery Agent initiative, focused on autonomously exploring the vast and diverse space of data sources to identify the most relevant data for each task.
The Data Quality Assurance Agent project, focused on continuously monitoring the data we compile for both structural and logical quality issues.
Finally, there is the tension between scale and data separation. As the dataset matures, the opportunity grows: more platforms, deeper histories, and greater granularity. Yet not all data can or is allowed to be mixed. Our architecture is designed to respect those boundaries fully, applying common standards, schemas, and methodologies across assets while keeping the underlying data appropriately ring-fenced.
3. How the data flows
The diagram below illustrates the conceptual stages that a single client’s data passes through, from source platforms to AI-ready output. Each client’s data follows this path independently, within its own isolated environment.
The pipeline is designed so that each client’s data is processed end-to-end within its own boundary. The shared element is the methodology and engineering, not the data itself. The key stages are:
Data ingestion. Campaign data is pulled from marketing platform APIs (Meta, TikTok, etc.) and routed into a client-specific storage environment.
Isolated client data store. Each client’s data resides in its own segregated environment, ensuring strict separation at every layer. No data is co-mingled or shared across clients.
Transformation and enrichment. A standardised pipeline reads from the client’s own data store, enriches fields, applies cleaning rules, and engineers features. The pipeline runs on a regular cadence.
Campaign Intelligence Dataset. The output: a compliant, harmonised, AI-ready dataset. Clean, structured, and securely ring-fenced.
AI model training. WPP Research pods use the data to train and validate models for various research projects focused performance forecasting, creative analysis, and more.
4. Impact
A large and diverse dataset that compounds over time. The data covers an ever-increasing population of platforms, audiences, geographies, creatives, and other modalities, all attached to rich performance metrics.
AI-ready from day one. A common harmonised schema design captures key modalities alongside outcome metrics, purpose-built for model training. Multiple WPP Research pods already consume these datasets for forecasting, performance prediction, and creative analysis.
Privacy by design, enforced by architecture. Data isolation is not a policy overlay; it is embedded in the pipeline’s design. Access controls, segregation, and governance rules ensure that each client’s data is governed, processed, and analysed independently and while respecting all applicable constraints.
5. Lessons learned and what’s next
Building the Campaign Intelligence Dataset surfaced a few principles worth naming:
Centralise the methodology, not the data. The pipeline design, schema definitions, and engineering tooling are reused across engagements, but each data asset is processed and stored in complete isolation.
Domain knowledge isn’t optional. Understanding the business cases behind the data is what catches the edge cases and quality issues that pure engineering misses.
Treat data policy as architecture, not overhead. Vigilance about who can access the dataset, why, and under what conditions is a design constraint baked into every layer of the system.
With a strong foundation set, the next priorities are:
Field enrichment. Continue exploring platform documentation for additional attributes that help understand what drives performance.
New platforms. Onboard additional social media and DSP platforms to uncover more nuanced patterns across a broader set of channels.