Blog

  • Cracking the Code of Campaign Success with Google’s AlphaEvolve Agent

    In the fast-paced world of digital marketing, one deceptively simple question keeps resurfacing: “What knowledge can we extract from successful past campaigns to make better future marketing decisions?”

    Every brand sits on a goldmine of historical campaign data: thousands of images, videos, and overall campaign configurations that either soared or sank. The challenge isn’t a lack of information; it’s injecting that knowledge at the precise moment the next decision is being made. How do we operationalise lessons learned to answer questions like:

    • Prediction
      Given Brand A, the target region of São Paulo, a set of creatives featuring outdoor sports imagery, and audience group of millennials aged 25–34, how well is the campaign expected to perform? “
    • Recommendation
      “Given Brand B and a target region of Milan, what should the creatives (videos/images) look like to maximise engagement among environmentally conscious consumers aged 15–18?”

    A common suggestion is to simply “ask an AI.” While modern Large Language Models (LLMs) are remarkably capable and encode broad real-world knowledge, they lack the tribal knowledge embedded in your proprietary data. They don’t know your specific brand voice, your audience’s unique quirks, or the subtle patterns behind your past failures. To truly win, you need a system that learns from your history—the hits, the misses, and everything in between.

    To address this, the WPP Research invests significant effort in developing prediction and recommendation models trained on large and diverse volumes of historical campaign data. These models are highly competitive and continuously improving. However, at some point during development, progress inevitably hits a plateau: even incremental gains—rarely exceeding 1%—demand extensive bibliographic research, days or even weeks of trial-and-error experimentation, and painstaking fine-tuning.

    With time at a premium, a vast space of possible improvements to explore (architectural changes, hyperparameter tuning), and experiments that are inherently slow to run, we turned to Google’s AlphaEvolve (AE) [1]: a Gemini-powered agentic framework that reframes model development as an evolutionary search problem. Rather than relying on manual experimentation, AlphaEvolve autonomously proposes, evaluates, and refines candidate model architectures in an iterative loop, guided by the expertise of our Data Science team and grounded in objective performance metrics.

    The results are striking: what weeks of manual experimentation struggled to improve by a single percentage point, AlphaEvolve achieved in a fraction of the time, delivering prediction accuracy gains of up to 10% on both synthetic and real datasets, while simultaneously lifting downstream recommendation scores up to 7%.

    Our access to AlphaEvolve came through Google’s Early Access Program (EAP), within the context of the ongoing partnership between Google and the WPP Research. Throughout our adventures with AlphaEvolve, we have been collaborating closely with the Google Research team, providing and receiving feedback. This collaboration has been invaluable to the project’s success.

    The AlphaEvolve Advantage

    Building a good AI model is painfully slow. A team of experts reads through mountains of research papers, rewrites code by hand, and runs experiments that can take days, only to find the improvement is tiny, or worse, a dead end. This research → code → test → repeat cycle creates a huge gap between having data and actually getting value from it.

    And even after you pick a model architecture, you still have to tune it. Think of it like adjusting the equalizer on a stereo: dozens of sliders, each affecting the sound, and you’re trying to find the perfect combination by ear. Techniques like grid search and Bayesian optimization help, but they’re still limited by what the human designer guesses might work. Not what the data actually needs. Trying every possible combination? Far too expensive and slow.

    The honest truth is that the search space is simply too vast for human intuition and trial-and-error to navigate. This is exactly where AlphaEvolve (AE) changes the game.

    Instead of a person manually tweaking one model at a time, AE treats the entire development process as an evolutionary search. Much like natural selection, but for code. It generates candidate models as functional programs, runs them, and scores each one against a target metric. It doesn’t just tune models. It designs them from scratch.

    Under the hood, AE is powered by Google’s state-of-the-art Gemini model, working hand-in-hand with a curated program database from Google DeepMind. Together, they explore millions of possible code configurations, zeroing in on the most accurate solution that meets our constraints. A search of this breadth would take a human team months. AlphaEvolve does it in a fraction of the time.

    By shifting from manual experimentation to this autonomous framework, we don’t just speed things up. We uncover strategies and architectures that human intuition alone would never find. Figure 3 illustrates this iterative loop in action.

    Figure 3 \ AlphaEvolve is a Gemini-powered coding agent from Google that automatically improves algorithms through a “generate, test, and refine” loop. The user provides three inputs: a description of the problem, a way to score candidate solutions, and a starting program to build from. AlphaEvolve then proposes many code variations using Gemini, scores each one automatically, and keeps the best-performing ideas—recombining and evolving them over multiple rounds, much like natural selection. With each cycle, the solutions get sharper, often surpassing what the original starting point could achieve.

    Guided Evolution: The Human in the Loop

    AlphaEvolve is autonomous, but it is not unsupervised. Think of it as digital evolution: the AI proposes ideas and keeps only the winners to build upon in the next generation. This process still requires careful navigation by our Data Scientists, who provide clear system instructions and constraints to guide the search through an infinite landscape of potential improvements, while inspecting for deviations introduced by the stochastic nature of LLMs. The result is a search that stays focused on logical, high-quality architectures and respects the real-world boundaries of the problem we are addressing.

    In the example below, we illustrate the inputs that AE expects from the human in the loop, as well as the output that it produces.

    Input 1: A System prompt describing the problem and steer evolution towards search directions.

    An example system prompt is: “Evolve a training model for a neural network 3-class classifier that achieves high accuracy on a provided dataset. The model must consist of a loss function that… . Focus on the multi-objective optimization of the following scores… Consider changing the model architecture to include…

    Think of the System Prompt as the instruction manual you hand to AlphaEvolve before it starts work. Imagine hiring a highly skilled but very literal engineer. They’re brilliant, but they need a clear, written brief to work from — they won’t assume anything. The System Prompt is that brief. It channels AlphaEvolve’s enormous computational power toward the right problem, in the right direction. It covers:

    • What the job is — e.g., “Build a model that can classify campaign outcomes into three categories.”
    • What the rules are — constraints it must respect, such as how the input data is structured or what the model architecture must look like.
    • Where to focus — specific areas to explore and improve, for example: “Try changing the loss function”.
    • What success looks like — the specific performance goals it should be optimising for (e.g., accuracy scores). This is also why the human expertise of the Data Science team remains critical.

    Input 2: A Seed Program with an initial solution that you hand to AlphaEvolve to improve.

    Rather than asking AlphaEvolve to build something from scratch, you give it a model that already works — and ask it to make it better. The team deliberately marks which parts AlphaEvolve is permitted to experiment with (using special labels in the code), and which parts must remain untouched. The Seed Program represents the accumulated expertise and investment already put into your AI models. AlphaEvolve doesn’t throw that away — it builds on top of it. It’s the difference between renovating a solid building versus demolishing it and starting over.

    Input 3: The Target metric that AE will attempt to maximise in order to achieve our objective.

    The Target Metric is essentially how the business defines “better.” This is a critical decision made by the Data Science team — not the AI. If the metric is well-chosen, AlphaEvolve will find solutions that genuinely deliver business value. If it’s poorly defined, the AI could optimize for the wrong thing entirely. Imagine you’re running a sales team and you’ve set a clear goal: maximise the conversion rate. Every change your team tries — new pitch, new pricing, new outreach method — gets evaluated against that one number. If a change improves the conversion rate, you keep it. The Target Metric works exactly the same way for AlphaEvolve. It might be something like “predict campaign performance as accurately as possible” — expressed as a single numerical score. AlphaEvolve runs each candidate model, checks the score, and keeps only the ones that do better. So the Target Metric is the objective, measurable definition of what winning looks like.

    Input 4: The Stopping criteria.

    The Stopping Criteria is simply the pre-agreed rule for when to call it done. Since AlphaEvolve could theoretically keep running and experimenting forever, the team sets clear boundaries upfront for when the experiment should end. A maximum number of rounds — e.g., “Run up to 500 iterations, then stop.” A performance threshold — e.g., “Stop as soon as the model reaches 90% accuracy.” This is like saying: “Once we’ve hit our goal, there’s no need to keep going.”

    Output: a ranked list of improved AI models.

    Figure 4 shows a ‘before’ (left) and ‘after’ (right) comparison of a section of a seed program that AlphaEvolve was asked to improve. Changes are highlighted in green. We observe several changes:

    • Training parameters were upgraded. For example, the number of training cycles (EPOCHS) was increased, the model’s internal size (PROJ_DIM) grew and a regularisation setting (WEIGHT_DECAY) was adjusted. These are the kind of fine-tuning decisions that would normally take a data scientist considerable time and experimentation to arrive at.
    • The model’s internal logic was redesigned. The component responsible for processing data (the “encoder”) was restructured and even renamed to better reflect its purpose. AlphaEvolve didn’t just tweak numbers. It proposed a more sophisticated architecture. New techniques were introduced.
    Figure 4 \ Example of an evolved block of code where AE is permitted to modify the contents of this segment. The function contents are modified and changes in names are reflected in other code blocks appropriately. Note that training parameter values are suggested as well indicating compatible architectural changes with hyperparameter tuning.

    Results: does it actually work?

    AlphaEvolve was applied to two core problems:

    • Performance Prediction, which estimates a campaign’s performance based on its configuration.
    • Performance-aware recommendation, which suggests the optimal way to complete/update a campaign’s configuration, in order to maximise its performance.

    Both models had already reached a highly competitive baseline with further manual improvements stalling below 1%.

    Datasets

    We evaluated all models on a suite of six datasets: five synthetic (details using an internally developed pipeline can be found here) and one real-world. This yielded datasets spanning a range of regimes: easy/medium/hard, depending on the noise profile and class balance – classes with a fewer samples are characterized as minority. Easy and imbalanced (V15), medium and imbalanced (V16), hard and imbalanced (V17), medium and balanced (V25, V26). The real-world dataset consists of actual historical campaign records and serves as the ultimate validation of whether gains observed on synthetic data transfer to production conditions.

    Prediction

    Three top-performing AE-evolved variants (Centroid_Loss, Cross_Modal_Attn, Focal_Loss) were identified across multiple experiments. All three consistently outperformed the base model across synthetic and real-world datasets.

    In order to asses the model performance we use the industry standard F1 score that is a way of measuring how good a model is at classification (class ‘POS’ is high-performing, class ‘NEG’ is low-performing, class ‘AVG’ average-performing) , balancing two things: Precision — “When the model says something is positive, how often is it right?” Recall — “Out of all the actual positives, how many did the model catch?” If the model is good at one but terrible at the other, the F1 score will be low. We calculate the F1 score separately for each class (NEG, AVE, POS), then take the plain average avg F1-score.

    • On easy/medium synthetic data (V15, V16): Cross_Modal_Attn achieved the strongest overall performance, reaching 93.09% avg F1-score on V15 (vs. 90.22% baseline) and a striking +11.6 percentage point improvement on the hardest-to-classify minority class POS on V16 (POS F1: 80.20% vs. 68.61%).
    • On the hardest synthetic dataset (V17): Focal_Loss broke through a performance floor that other variants could not — the base model scored 0% on both minority classes (NEG and POS), while Focal_Loss achieved 15.83% and 25.39% respectively.
    • On real-world data: Centroid_Loss delivered the most practically significant gains — +8pp avg F1 (71% vs. 63%), +11.74pp NEG F1, +8.33pp POS F1, and +5.11pp accuracy — validating that AE’s improvements hold on actual production data.

    Across all datasets and variants, gains on minority classes (correctly identifying hig-performing and low-performing campaigns) were consistently larger than gains on the majority class — a particularly valuable outcome given that minority-class accuracy is the critical input for the recommendation model.

    Recommendation

    The recommendation model, which relies on the prediction model’s outputs, was evaluated both in isolation and in a fully evolved end-to-end pipeline. The recommendation score (higher is better) is a metric that measures how good the recommendations are by comparing them against a known “ground truth” (applicable to synthetic datasets). It rewards recommendations that correctly identify high-performing campaign configurations, whereas it penalizes two kinds of failures: i) empty (the model couldn’t suggest anything) ii) of low-quality (the model suggested something, but it performs poorly).

    • Swapping in the AE-evolved predictor alone improved recommendation scores meaningfully: +6.5% on easy data (V15), +9.8% on medium data (V16), and lifted the hard dataset (V17) from a score of 0.0 (which essentially means that all recommendations were wrong) to 0.29.
    • Combining the AE-evolved predictor with an AE-evolved recommender produced the strongest results across all datasets, with the fully evolved pipeline achieving scores of 0.5 (V15), 0.4 (V16), and 0.36 (V17) — confirming that the gains from prediction and recommendation evolution are additive.
    • Recommendation improvements of up to 7% were observed when both components were evolved together.

    Conclusion

    AlphaEvolve works — and it works exceptionally well. It represents a meaningful and measurable step forward in model development. Applied to WPP AI Lab’s campaign prediction and recommendation models, which had already reached a performance plateau through conventional means, AlphaEvolve delivered prediction accuracy gains of up to 10% on both synthetic and real datasets, while simultaneously lifting downstream recommendation scores by up to 7%. It surfaces architectural strategies and configurations that lie beyond the reach of human intuition alone, not by replacing the expertise of our Data Science team, but by amplifying it. The human-in-the-loop dynamic remains essential: our scientists shape the search space, define meaningful constraints, and validate the outputs.

    AlphaEvolve does the heavy lifting of exploration. As prediction and recommendation models continue to grow in complexity, AlphaEvolve offers a glimpse of a future where the gap between data collection and model improvement is measured in hours rather than weeks, and where the best-performing systems are not just built by experts, but co-designed with AI.

    This project was a collaboration between the WPP Research team including: Anastasios Tsourtis and Theodoros Lappas and the AI for Science team at Google Cloud including (but not limited to): Kartik Sanu, Laurynas Tamulevičius, Nicolas Stroppa, Chris Page, Gary Ng, John Semerdjian, Skandar Hannachi, Vishal Agarwal, and Anant Nawalgaria, Gabriela Hernandez Larios and partners at Google DeepMind

    References

    1. Novikov, A., Vũ, N., Eisenberger, M., Dupont, E., Huang, P.-S., Wagner, A. Z., Shirobokov, S., Kozlovskii, B., Ruiz, F. J. R., Mehrabian, A., Kumar, M. P., See, A., Chaudhuri, S., Holland, G., Davies, A., Nowozin, S., Kohli, P., & Balog, M. (2025). AlphaEvolve: A coding agent for scientific and algorithmic discovery. arXiv:2506.13131 [cs.AI]. https://arxiv.org/abs/2506.13131

    Ready to explore the specifics? Read our full technical deep dive into the technical report for a closer look at our methodology.

    Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP Research team.

  • Using Synthetic Data to Train and Stress-Test Marketing Machine Learning Models

    Unlocking machine learning experiments across multiple teams with a synthetic data pipeline grounded in marketing knowledge

    Training Machine Learning (ML) models for marketing usually starts with a hard requirement: labelled data that links campaign settings and attributes to actual performance outcomes. You collect campaigns, look at what combinations of brand, audience, platform, and geography performed well, and train a model to learn from those patterns.

    In theory, that sounds straightforward, but in practice, real data is hard to clean and structure, arrives slowly, takes time to accumulate and only reflects combinations you’ve already run. If you’ve never targeted a certain audience on a certain platform for a certain brand, that example simply doesn’t exist in the dataset. And if multiple teams are waiting on that data before they can even begin experimenting, progress stalls fast.

    We ran into exactly that problem.

    We needed a way to start training and benchmarking marketing ML systems before AI-ready real campaign data was available at a useful scale. So instead of waiting for the data, we built a synthetic data pipeline that could generate realistic, labelled training data grounded in how marketing actually works.

    That pipeline ended up unblocking model experiments across multiple teams.

    The Problem With Random Synthetic Data

    Real campaign data is essentially rows of campaign attributes (brand, audience, location, platform, placement, creative, and many more) each labelled with how that combination performed. That’s what a model learns from.

    This kind of data is easy to fake badly. You can always create random combinations of attributes and assign them labels. But for marketing, random is worse than useless if it ignores real-world compatibility. A luxury brand paired with bargain-hunting audiences, or a B2B enterprise software brand matched with a fashion lifestyle platform, doesn’t help an ML model learn. It teaches the wrong lessons.

    So the challenge wasn’t just “generate fake data”. It was:

    1. Capture that marketing knowledge in a structured, machine-readable form
    2. Use that structure to generate realistic campaign configurations at scale

    What we needed was a structured way to encode that compatibility: given any combination of campaign settings, does it make sense or not?

    Encoding Marketing Knowledge as a Graph

    We chose a versatile structure: a graph.

    In a marketing knowledge:

    • Nodes represent attribute values for different modalities, such as brand, audience, platform, country, and any other factor that can influence the outcome of a campaign.
    • Edges represent compatibility between two attributes:
      • A positive edge (+) means the pair is expected to work well together within the same campaign.
      • A negative edge (-) means the pair is a bad fit, likely to damage the cohesiveness and the performance of the campaign.
      • No edge means there’s no meaningful signal. A neutral fit.

    That gives us a machine-readable map of marketing relationships.

    Some simple examples:

    • LinkedInC-Suite Executives → positive
    • Luxury brandBudget shoppers → negative
    • SalesforceTikTok → negative
    • AdidasK-pop fans → positive

    This structure worked well for three reasons:

    • It naturally captures many-to-many relationships
    • It’s easy to extend with new brands, audiences, and platforms
    • It’s interpretable enough for humans to inspect and validate

    Once you have that graph, you can start generating synthetic campaign examples that are constrained by actual compatibility signals instead of randomness.

    The Bottleneck: Building the Graph was Expensive

    The obvious way to build this graph was to leverage the capabilities of Large Language Models to classify every possible pair of attributes from a catalogue of brands, audiences, geographies, and other marketing settings of interest.

    That approach can work for small catalogues, such as 20 brands, 50 audiences, 10 countries, and 5 platforms. But those are not especially useful in practice, since ML models need data that is both diverse and high-volume.

    As the catalogue grows, pairwise combinations quickly become a bottleneck. Even a moderately sized catalogue creates thousands of cross-modality pairs. As the number of attributes increases, the number of possible pairs grows quadratically. That made a brute-force approach too slow and too expensive for routine iteration. Even considering batch calls, like a primary attribute compared to a target list of attributes, it would still be too much.

    So we needed a way to build the graph without evaluating the entire space of possible combinations.

    But that creates an obvious dilemma: how do you find the important pairs without first checking them all?

    Two Ways We Approached Graph Generation

    To answer that question, we implemented and compared two graph generation strategies.

    1. Batched brute-force pair classification

    A truly naive strategy would have been to ask the LLM about every single attribute pair one by one, but we did not test that because it is clearly too inefficient to be practical.

    Instead, for each valid cross-modality combination, we selected one primary attribute and asked the LLM to classify its relationship to a batch of up to 25 target attributes as positive, negative, or neutral.

    The batch size of 25 was chosen deliberately:

    Prior work shows that batch size affects LLM classification quality: larger batches are more efficient, but can reduce consistency across judgments. We therefore set the batch size as a practical trade-off between efficiency and quality.

    This gave us a strong reference point: broad coverage with a simple implementation, useful for evaluating whether a more efficient method could preserve similar graph quality without the same cost.

    2. Cluster-first graph generation

    The second approach was designed to reduce the search space before asking the LLM to score anything.

    Instead of classifying every attribute pair directly, we first:

    • embedded the attributes and applied UMAP for dimensionality reduction,
    • clustered them by modality using HDBSCAN,
    • asked the LLM to batch score compatibility between clusters,
    • discarded neutral cluster pairs and their attribute combinations,
    • automatically assigned scores to attribute combinations derived from high-confidence cluster pairs,
    • and asked the LLM to batch classify only the remaining attribute pairs.

    This turned a very large search space into a much smaller one, so the LLM spent time only where useful signals were more likely to exist.

    For small catalogues, the efficiency gains are smaller because many attributes end up as singleton clusters, but the same architecture still applies.

    What Happened When We Compared Them

    On a larger catalogue of 160 attributes — 60 brands, 60 audiences, 10 platforms, and 30 countries — the cluster-based approach performed much better operationally.

    Compared with brute force, it delivered:

    • 53% fewer LLM calls
    • 50.5% less execution time
    • 90.6% of the total edge volume retained

    More importantly, where both methods produced an edge for the same pair, they agreed on the sign 98% of the time. This shows that the cluster-based approach is not systematically changing the meaning of the relationships it recovered.

    The main trade-off was coverage: some pairs found by brute force were filtered out before attribute-level scoring, likely around lower-signal or more borderline cases.

    In practice, this gave us a much cheaper way to generate the graph while preserving the compatibility signal that mattered most.

    The scaling advantage becomes even clearer when projected to larger catalogues:

    Batched brute-forceBatched cluster-based
    Catalog Size**Total pairs *****LLM calls *****LLM calls ***
    160 attributes8,700570265
    320 attributes (2×)~34,800~2,280~750
    800 attributes (5×)~217,500~14,250~2,960
    1,600 attributes (10×)~870,000~57,000~8,400
    • These are directional estimates extrapolated from the 160-attribute experiment. Actual call volumes will vary with catalogue structure, clustering behavior, and graph densities.

    From Graph to Actual Training Data

    Once we had a signed graph, the next problem was turning it into an actual labelled campaign performance dataset.

    Each row in this dataset represents one synthetic campaign configuration (a combination of attributes drawn from the graph) along with a performance label: pos, neg, or avg. That label is the training target. It describes whether the overall campaign combination is expected to perform well, underperform, or land somewhere in between.

    Important note: The label is not the same as a graph edge. Edges score pairs of attributes; the label scores the whole configuration, aggregated across all its edges signs.

    Figure 1 – Example of row from the campaign performance dataset

    This dataset is the output of the second service in the pipeline: the Synthetic Dataset Generator. Its job is to create synthetic campaign records from the graph while respecting configurable constraints such as:

    • how many attributes of each type should appear in each sample,
    • how many positive, negative, and average examples to produce,
    • and what proportion of positive vs. negative edges each label class should contain.

    For example, a positive sample might require a relatively high fraction of positive edges and a low fraction of negative ones. A negative sample would do the opposite, while an average sample would contain more balanced fractions of both.

    That gave us complete control of the dataset. The same graph could generate multiple datasets (with different class balances, difficulty levels, noise profiles, and schemas), just by changing configuration, not rebuilding the pipeline.

    Simulated annealing: searching the graph efficiently

    To find valid combinations to generate each dataset row efficiently, we used a parallelized simulated annealing sampler. The name comes from a steel mill process, where a material is heated and then cooled in a controlled way to reduce defects and settle into a more stable structure.

    Our algorithm follows the same idea. It starts in a “hot” state, exploring many possible campaign configurations and even accepting imperfect ones early on. As it cools, it becomes more selective, swapping attributes in and out until each sample settles into a configuration that satisfies the requested constraints.

    Downstream Impact and ML Experiments Unlocked

    This service was not just a technical exercise. Its purpose was to unblock machine learning workstreams while real campaign data was still limited, not ready, or missing key combinations. Without it, multiple experiments would have been blocked.

    The Synthetic Dataset Generator produced 49 synthetic datasets, built from multiple graph versions and configurations. Those datasets were used to both train and stress-test models across different teams and modelling approaches. Each dataset varied in class balance, difficulty, and noise to probe how models behaved under pressure. Experiments included:

    • Campaign performance prediction
    • Federated learning experiments
    • Architecture search and model benchmarking
    • Comparisons between fine-tuned LLMs and custom classifiers

    We also built a shared model leaderboard so teams could compare results across dataset versions and training approaches without manual coordination.

    That created a common experimental foundation before real data was fully ready.

    What Synthetic Data Did (and Didn’t) Solve

    Synthetic data was an accelerator, not a replacement for real data.

    It let us:

    • start ML experiments earlier,
    • benchmark model architectures,
    • explore dataset schemas,
    • test class balance and difficulty settings,
    • and support teams that otherwise would have had to wait

    But it also has several limitations:

    The biggest one is that graph edges are still inferred, not directly validated against large-scale real campaign outcomes. We verified obvious cases, but many of the more ambiguous relationships remain assumptions generated from LLM reasoning rather than empirical evidence.

    References

    Van Can, A. T., Aydemir, F. B., & Dalpiaz, F. (2025). One size does not fit all: On the role of batch size in classifying requirements with LLMs. In Proceedings of the 2025 IEEE 33rd International Requirements Engineering Conference Workshops (REW 2025) (pp. 30–39). IEEE.

    Tam, Z. R., Wu, C.-K., Tsai, Y.-L., Lin, C.-Y., Lee, H.-Y., & Chen, Y.-N. (2024). Let me speak freely? A study on the impact of format restrictions on performance of large language models. arXiv:2408.02442https://doi.org/10.48550/arXiv.2408.02442

    Delahaye, D., Chaimatanan, S., & Mongeau, M. (2019). Simulated annealing: From basics to applications. In M. Gendreau & J.-Y. Potvin (Eds.), Handbook of Metaheuristics (Vol. 272, pp. 1–35). Springer. https://doi.org/10.1007/978-3-319-91086-4_1

  • Why your data genome may need a check-up – and how a Data Discovery Agent can help


    We collected every sample. We sequenced nothing.

    The story of big data is, at its core, a twenty-year experiment in accumulation. Google’s landmark papers on the Google File System (2003) and MapReduce (2004) showed the world how to store and process data at unprecedented scale [1] [2]. The Hadoop ecosystem followed. Then the cloud era – BigQuery, Redshift, Snowflake – made collection cheaper, faster, and infinitely more elastic. The message was unambiguous: store everything, figure it out later.

    And store they did. IDC projects that global data creation will grow at roughly 25% compound annual rate through 2028 [3]. Companies rushed to stake claims on every click, every impression, every conversion, convinced the data itself was the gold. But gold in the ground is worthless without extraction, refining, and assay. Most enterprises skipped those steps. Industry research finds that up to 90% of generated data remains unused, a phenomenon analysts have coined “dark data” [4]. The storage invoice arrives monthly. The insight dividend has not been declared yet [5] [6].

    Enter the paradigm shift. Large Language Models (LLMs) – GPT-3 (2020), ChatGPT (2022), Google’s Gemini (2023-2024) – represent the first technology that can read data at the semantic level [7], understanding meaning rather than merely executing queries. A data warehouse full of unanalysed tables is, in many respects, like an unsequenced genome. The information is all there – every column a base pair, every table a chromosome – but without annotation, it’s just a very expensive string of letters.

    Enterprise data is at that same inflection point. The sequencer has arrived. The genome is finally being read.


    2,791 base pairs. No annotation. Good Luck, Have Fun, Don’t Die.

    Every ad-tech data team carries an invisible tax: the hours, the errors, and the opportunity cost of manually reconciling platform schemas that were never designed to talk to each other. This is not a one-time project either but a recurring levy. Every time a platform updates its API, every time a new data source is onboarded, every time someone asks “do we even have geo data from Pinterest?” – the tax collector comes knocking.

    At the heart of our work sits a centralised advertising data warehouse that aggregates campaign performance, creative assets, audience signals, geographic breakdowns, and brand metadata from every major digital advertising platform our organisation operates on. It is, in effect, the single source of truth for understanding how creative content performs across the entire digital media landscape. It is also, as we were about to discover, a genome that had never been sequenced.

    We operate across fourteen advertising platforms – spanning major social, search, programmatic, and measurement partners. Each has its own schema. Facebook calls advertising expenditure amount_spent. Google calls it cost_micros (and means it, in millionths of a currency unit). TikTok simply says spend. They all mean roughly the same thing, but to a database and to a human analyst trying to build a cross-platform report, they might as well be different species encoding the same protein with entirely different codons (/ˈkəʊdɒn/ – three-letter DNA sequences that each specify the same amino acid, just spelled differently by each organism).

    Data integration platforms and transformation layers help – they abstract some of the raw API complexity and reshape the data before it lands in the warehouse. In theory, these layers converge towards clarity. In practice, they shift the problem rather than solve it. By the time data reaches the staging tables analysts actually query, the original platform semantics have been filtered through multiple translation layers – each managed by a different team, each with its own conventions. The result is not less complexity; it is distributed complexity.

    Now scale the problem. Our Google BigQuery data warehouse spans 179 tables and 2,709 columns across fourteen platforms, holding over 15.8 billion records. And those columns need to be understood by meaning, not merely by name.

    One downstream use case requires them to be mapped to a canonical schema organised around five modalities – high-level categories that describe what a piece of advertising data is about:

    • Performance – the numbers that tell you how a campaign did: spend, impressions, clicks, video views, conversions, leads.
    • Creative – what the ad actually looked like: creative IDs, asset paths, ad names, format types.
    • Audience – who saw it: gender, age group, interests, custom audiences, behavioural segments.
    • Geo – where they saw it: countries, regions, cities, postal codes, designated market areas.
    • Brand – who paid for it: brand name, advertiser identity.

    Each modality breaks down further into 24 sub-modalities – the individual genes in our metaphor. That’s 2,709 potential annotations to make, verify, and maintain.

    A skilled data analyst doing this manually needs to: open each table schema, read every column name and type, pull sample data to disambiguate, decide which canonical sub-modality it maps to, document the mapping, and then repeat thousands more times – praying that nothing changed since they started.

    To make that concrete: imagine staring at a table with columns ad_iddate_start, spend, inline_link_clicks, cpc, frequency, reachaccount_currency. Is spend total or daily? Is cpc cost-per-click or cost-per-conversion? Does reach mean unique users or total impressions? You run a SELECT * LIMIT 10, squint at the numbers, cross-reference the API docs, and after twenty minutes you’ve mapped eight columns from one table. Only 2,701 to go – across thirteen more platforms.

    Conservatively, that’s two to four weeks of focused work for a single pass. And the result starts decaying immediately. Platforms update schemas quarterly, sometimes monthly. By the time the spreadsheet is “done,” it’s already wrong.

    If you’ve ever opened a file called FINAL_mapping_v3_actually_final.xlsx, this section is for you.

    In genomic terms: it’s annotating a genome by hand, one base pair at a time, with no automated sequencer and no reference genome. The Human Genome Project proved you can do it that way. It just took thirteen years and $2.7 billion. We wanted something faster.


    The sequencer has entered the lab

    What used to take a team of data analysts two to four weeks of grinding manual work (opening schemas, pulling samples, cross-referencing documentation, writing mappings into a spreadsheet) now runs in hours. The Data Discovery Agent processes all fourteen platforms, all 179 tables, all 2,709 columns, and delivers a fully annotated mapping with confidence scores.

    Figure 1 – Manual mapping workflow vs. Agent pipeline: weeks of spreadsheet wrangling compressed into a single automated run.

    The Data Discovery Agent is not a chatbot, nor a single prompt you paste into ChatGPT and hope for the best. It is a multi-stage autonomous platform that orchestrates a complete end-to-end pipeline of discovery, sampling, reasoning, and reporting with minimal human intervention. You point it at a data warehouse, define your canonical schema, and it does the rest: connecting to BigQuery, fetching every table and column, extracting sample rows for grounding, crawling external documentation for cross-reference, building structured prompts, invoking LLM inference, parsing the results, scoring confidence, calculating completeness, and rendering the whole thing in an interactive dashboard. The human’s job shifts from doing the mapping to reviewing the mapping: approving high-confidence matches, investigating ambiguous ones, and making strategic decisions about data gaps.

    So, what is an agent?

    In the AI and machine learning community, an agent is formally defined as a system that: (1) perceives its environment

    (2) reasons about what actions to take

    (3) executes those actions autonomously

    (4) iterates towards a goal – often using tools and external data sources along the way.

    Figure 2 – A simple visual of how an AI agent operates: Perceive → Reason → Act → Iterate.

    By that definition, our Data Discovery Agent is genuinely an agent: it perceives the warehouse (schema fetching), reasons about semantics (LLM inference with structured prompts), takes action (mapping columns, scoring confidence, crawling documentation), and iterates (batch processing across platforms and modalities, second-pass enrichment recommendations). This is far more than a single LLM call wrapped in a REST endpoint. It’s an orchestrated pipeline of perception-reasoning-action loops.

    Yes, the word “agent” has been stretched to meaninglessness in 2025-2026 – every API wrapper calls itself one. We use it deliberately, with the formal definition above, because our system genuinely exhibits autonomous multi-step goal-directed behaviour. If your “agent” is a prompt template with a for-loop, “script” might be a more honest label.


    Inside the sequencer: how we taught an LLM to read a data warehouse

    Teaching an LLM to read a data warehouse is not as simple as dumping a schema into a prompt. Column names are ambiguous. Data types are insufficient. Context is everything. Our pipeline gives the model maximum context at every step – schema structure, real sample values, external documentation, and a precisely defined target taxonomy – so that its annotations are grounded, not hallucinated. In genomic terms, this is the equivalent of handing a bioinformatician the raw sequence alongside the reference genome, the medical textbook, and a clear checklist of which genes to look for.

    The following five stages take us from raw warehouse to annotated genome.

    Stage 1 – Discovery

    The agent connects to Google BigQuery and discovers all staging datasets – in our case, fourteen platforms matching a configurable naming pattern. For each dataset, it fetches the full table inventory: column names, data types, row counts. Intelligent filters exclude temporary artefacts (tables prefixed with stg_, suffixed with _tmp or _dbt_tmp) – the genomic equivalent of filtering out sequencing adapters before analysis.

    Stage 2 – Sampling

    Column names alone are often ambiguous. The agent resolves this by extracting actual sample data rows from each table, using background-threaded parallel queries. These real values become critical evidence for the LLM: seeing 14.500.83127.99 in a cost column strongly suggests monetary spend, not an ID field. Column names are the labels on the test tubes; sample data is what’s actually inside them.

    Stage 3 – Documentation crawling

    The agent crawls the official connector documentation for each platform – the authoritative “Most Used Fields” pages published by our data integration provider – and extracts a structured field list: name, description, dimension or metric. This gives the agent a reference genome to cross-reference against what actually exists in the warehouse: the platform’s own user manuals turned into a checklist of expected signals.

    Stage 4 – LLM inference

    This is the core intellectual step. The agent builds versioned prompts containing the table schemas, sample data, and the target modality definition. These are sent to Google’s Batch Inference Service (VertexAI). The model returns structured JSON: for each canonical sub-modality, it identifies matching columns, assigns a confidence score (0.0–1.0), and provides natural-language reasoning for each match. This is the sequencing run itself – the machine reading every base pair and producing an annotated readout with quality scores.

    Stage 5 – Enrichment and dashboard (the genome report)

    A second-pass “Enrichment Recommender” inference cross-references the warehouse mappings from Stage 4 against the connector field lists from Stage 3. For every sub-modality where warehouse data is incomplete, it recommends which connector fields to enable – closing the loop from “what do we have?” to “what should we turn on?”

    All results render in a FastAPI + Jinja2/HTMX dashboard: the karyotype (/ˈkariə(ʊ)tʌɪp/ the number and visual appearance of the chromosomes in the cell nuclei of an organism or species), the gene annotation browser, and the clinical recommendations in one interface.


    The genome report: what 2,709 annotations revealed

    When the sequencing run completes, you don’t get a wall of text. You get a genome report in the shape of a visual, interactive, filterable dashboard that tells you, at a glance, which parts of your data genome are fully annotated and which have gaps.

    Figure 3 – Platform completeness scorecards: the karyotype of our data warehouse at a glance.

    Each platform receives a “completeness” score: how much of the canonical schema is actually present. The approach is deliberately simple – a presence/absence assay. For each platform, the denominator is 24 (total sub-modalities). The numerator is how many have at least one column mapped above a configurable confidence threshold. A sub-modality is either present or it isn’t, like a gene that’s either expressed or silent. This also handles deduplication: if three tables each have a spend column mapped to performance__spend_usd, the sub-modality counts once, not three times. The formula cares about breadth of coverage, not depth of redundancy.

    Figure 4 – The interactive dashboard: Submodality matching recommendation cards

    Across all fourteen platforms, average mapping completeness stands at 58%. The top performer reached 75% (18/24 sub-modalities covered), while the platform with the greatest room for improvement sat at 25% (6/24). Four platforms cluster at 70%, several sit in the 50–66% range, and a handful fall below 50%.

    Some findings were pleasant surprises – platforms assumed to lack audience data turned out to carry age range and gender columns that mapped cleanly. The data was there all along; it just hadn’t been annotated. Other findings were equally valuable: Performance metrics (spend, impressions, clicks) are well-represented across the board, but Audience, Geo, and Brand modalities show significant gaps, particularly in measurement and programmatic categories.

    At 58%, the genome is more than half-sequenced – but meaningful blind spots remain. That’s a useful headline. It is unequivocally better to know your genome’s current state than to assume it’s healthy without running the test. The scorecard turns vague unease into actionable intelligence (“We’re missing geo data on most platforms, and here’s exactly which fields to enable”). It also delivers a reality check: the foundation is stronger than expected in some areas and weaker than assumed in others. But now we know exactly where to invest.

    The Enrichment Recommender closes the diagnostic loop. Rather than simply flagging missing geo data, it prescribes a fix: “The connector for this platform offers a field called country_code (dimension); enable it, and your geo coverage improves.” Diagnosis and prescription in one step, and a natural bridge to the next phase of our work.


    What we’re building now

    The genome report is valuable on its own. But a sequencer that only reads and never acts is leaving half the value on the table. We are currently working on three initiatives that extend the agent from discovery into action; from reading the genome to editing it.

    1. Productionalisation – from lab bench to clinical practice

    Today the agent runs as an internal tool, triggered manually. We are moving it to production-grade: CI/CD pipelines, scheduled re-scans (daily, weekly, on-schema-change), and automated alerting when a platform’s schema mutates. The genomic equivalent of detecting a mutation before it causes problems downstream. A sequencer that runs itself.

    2. Automatic warehouse verification – is the gene actually expressed?

    The Enrichment Recommender tells you which fields exist in the documentation. But existence does not mean expression. A field can be listed, enabled in the connector, and still arrive empty because the advertiser never populated it.

    We are building a verification layer that connects to the data integration platform’s API, pulls each connector’s field list, and checks the warehouse: Is this column actually populated? What percentage of rows have non-null values?

    In DNA terms: it’s one thing to know a gene exists in the reference genome; it’s another to confirm it’s expressed in your organism. The practical outcome: instead of blind recommendations, the agent will say “Enable country_code – 87% of advertisers populate it” versus dma_region has only 3% non-null rows, probably not worth the effort.”

    3. Text-to-SQL – chatting with the genome

    Discovery answers “what data do we have?” The next question is always “what does the data say?” We are building a natural-language-to-SQL layer: ask “What was our total spend across social platforms last quarter?” and the agent translates it to a BigQuery query – using the column mappings it already knows – executes it, and returns a human-readable answer with the underlying SQL for transparency.

    This positions the platform as a full conversational data layer: discover the genome, annotate it, score completeness, recommend enrichment, and then interrogate it in natural language. From specimen collection to clinical consultation, without leaving the lab.


    The genome is sequenced. The real work begins.

    For two decades, the advertising industry collected data the way early geneticists collected specimens: enthusiastically, comprehensively, and with very little annotation. We built warehouses the size of genomes and staffed them with analysts armed with spreadsheets – the biological equivalent of a magnifying glass and a very long afternoon.

    The Data Discovery Agent is what happens when you finally bring a sequencer into the lab. It reads every base pair, annotates every gene, scores every alignment, flags every gap, and delivers a genome report that turns “we have data” into “we understand our data.” And unlike the Human Genome Project, it does not take thirteen years. It takes hours.

    The genome is sequenced. The annotations are in. The gaps are mapped. Now the real work begins: not the tedium of cataloguing, but the science of insight. And if the sequencer keeps getting smarter (spoiler: it will), the next chapter will be about more than reading the data. It will be about having a conversation with it.


    Ready to look under the hood? See Data Discovery Agent – Technical Walkthrough for the full engineering deep-dive – architecture, five-stage pipeline, Vertex AI inference, FastAPI deployment on Cloud Run, and code-level detail.


    1. Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. Proceedings of OSDI ’04.
    2. Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google File System. Proceedings of the 19th ACM SOSP.
    3. IDC, Worldwide Global DataSphere Forecast 2023-2028.
    4. Splunk dark data survey, summarised in “Dark Data: Unlocking the ~90% You Don’t Use” (2025).
    5. McKinsey & Company, “In search of cloud value: Can generative AI transform cloud ROI?” (2023).
    6. Forbes, “From Data Hoarding To Data Strategy, Building AI That Actually Works” (2025).
    7. Forrester, AI, data, and analytics predictions (2024).

    Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP AI Lab team.

  • From guesswork to foresight: How AI is predicting the future of marketing campaigns

    Ever wonder why some advertisements seem to pop up exactly when you’re thinking about buying something, while others feel completely irrelevant? Or how a brand knows just the right message to share to get your attention? The answer lies in the evolving world of marketing campaigns, and increasingly, in the powerful capabilities of Artificial Intelligence (AI).

    But what is a marketing campaign?

    At its core, a marketing campaign is a carefully planned series of activities designed to achieve a specific goal for a business – whether that’s selling more products, building brand awareness, or encouraging people to sign up for a service. Think of it like launching a rocket: you need to choose the right destination (your objective), design a powerful engine (your creative message), select the perfect crew (your audience), and pick the best launchpad (your platform).

    The process of creating and running these campaigns involves countless decisions, such as:

    • Audience: Who are we trying to reach? What are their interests, demographics, and behaviours?
    • Brand: What message do we want to convey about our brand? How does our brand resonate with the audience?
    • Creative: What kind of ads should we run? (text, images, videos, headlines, calls to action).
    • Objective: What’s the main goal? (e.g., getting clicks, making sales, increasing brand recognition).
    • Platform: Where should we run these ads? (e.g., Facebook, Instagram, Google Search, TV, billboards).

    Campaign design is a complex process shaped by multiple factors, such as creative genius, market insights, and domain expertise. Marketers launch campaigns, closely track their impact, and then adjust their approach in real time, refining messages or recalibrating target audiences. However, given the diverse and dynamic nature of consumer behavior, this iterative adaptation process can be taxing in terms of both budget and time. It’s like setting a rocket’s course: unforeseen atmospheric shifts can require significant mid-flight corrections, each consuming valuable resources.

    This is where the big challenge lies: how do we predict if a campaign will be successful before we invest significant time and money into it?

    Machine Learning: Your marketing crystal ball

    This challenge is precisely where Machine Learning (ML) steps in. Simply put, Machine Learning is a branch of AI that allows computers to “learn” from data without being explicitly programmed. Instead of following a strict set of rules, ML algorithms analyze vast amounts of past information, identify hidden patterns and relationships, and then use those learnings to make predictions or decisions on new, unseen data.

    In the context of marketing campaigns, ML becomes an incredibly powerful tool:

    • Data powerhouse: Imagine collecting every detail from thousands of past marketing campaigns: who saw them, what the ads looked like, where they were shown, how much they cost, and crucially, what the final outcome was (e.g., how many clicks, sales, or sign-ups they generated). ML algorithms can digest this colossal amount of data in seconds.
    • Pattern recognition: These algorithms don’t just store data; they look for correlations. Did campaigns with a specific type of image perform better with a certain age group? Does a particular headline style lead to more conversions on one platform versus another? ML can uncover these subtle yet powerful insights that human analysts might miss.
    • Predictive power: Once trained, an ML model can take the proposed details of a new campaign (e.g., its target audience, creative idea, intended platform) and predict its likely outcome. It can estimate click-through rates, conversion probabilities, or even the potential return on investment (ROI) before a single dollar is spent.

    The benefits are transformative: marketers can make data-driven decisions, allocate budgets more efficiently, target the most receptive audiences with precision, and ultimately, launch campaigns with a much higher probability of success. It’s like having a detailed weather forecast for your rocket launch, helping you choose the perfect day and trajectory.

    The multimodal challenge: Mixing apples, oranges, and billboards

    In reality, and contrary to what many might assume, a campaign isn’t just a neat row of numbers on a spreadsheet; it’s a vibrant, messy mix of text, images, locations, and abstract concepts like brand identity. This presents a fundamental challenge: how do we empower AI to not just process, but truly understand and effectively connect these inherently different types of information to form a holistic view? For instance, how can an AI understand the interplay between the nuanced visual cues of a video Ad with the detailed socio-economic data of a specific target audience in a specific location?

    The “secret sauce” is a technology called embeddings. Think of an embedding as a universal translator. It takes complex information, like the “feeling” of a brand or the intent of a sentence, and turns it into a list of numbers that an algorithm can easily digest. However, every piece of the campaign puzzle requires a different translation strategy.

    Translating the campaign puzzle

    To build a complete picture, we process each element through a specialized lens:

    • Audience, Platform, and Objective: We convert these categories into numerical “flavours.” This allows the AI to recognise the distinct profile of, for example, an Instagram awareness campaign versus a search engine lead-generation tactic.
    • Brand identity: We leverage the fact that Large Language Models (LLMs) already possess a wealth of knowledge about established brands. By feeding the AI a rich, descriptive profile of a brand, we create a deep numerical representation of its identity. This task is so nuanced that it led to the birth of our Brand Perception Atlas Pod.
    • Creative (Images): A picture may be worth a thousand words, but our models currently prefer numbers. To bridge this gap, we use AI to extract a highly detailed description of each image, which is then translated into data. We quickly discovered that the quality of these descriptions depends entirely on the instructions given to the AI. This led us to develop the Self-improving AI Agent.
    • Geography: Location is more than just a pin on a map. To capture the true essence of a region, we use advanced models that go beyond coordinates. In detail, Google’s PDFM (Pre-trained Deep Foundation Models) Embeddings are able to capture the social, economic, and demographic fabric of an area, providing the AI with the “soul” of a location rather than just its name.

    Where does the data come from?

    Real-world marketing data is essential, but on its own it is not enough for AI research. At WPP, we combine rich, real-world data with carefully engineered synthetic data to build and evaluate models more effectively. Real data grounds our work in genuine market behaviour, complexity, and business context. Synthetic data adds something equally important: control. It allows us to create the specific conditions we need to properly challenge, probe, and improve our models.

    This matters because many of the scenarios that determine whether a model is truly robust are rare, emerging, or simply absent from historical records until the moment they become a real problem. To prepare for that, we deliberately generate datasets that introduce edge cases, shifting patterns, variable data volumes, heterogeneity, sparsity, and data drift. In other words, we use synthetic data to stress-test models in ways that real data alone cannot support, so they are more resilient, reliable, and ready for the real world.

    To address this, we built a Synthetic Data Generator. Think of this as a high-fidelity flight simulator for marketing. Instead of testing our models only on the limited “flights” we’ve taken in the past, this tool creates realistic, artificial campaign data. This allows us to:

    • Train with precision: We can create scenarios that haven’t happened yet to see how the AI reacts.
    • Test the limits: We can stress-test our models against extreme market conditions without any real-world risk.
    • Ensure safety: We can evaluate performance using high-quality data that carries none of the privacy concerns of personal information.
    • Hold the answer key: Because we generate this artificial data from scratch, we already know the exact outcome (the “ground truth”) of every scenario. It’s like giving our AI a test where we already hold the perfect answer key, allowing us to verify its predictions and recommendations.

    By “conjuring” this artificial data, we ensure our models are battle-tested and ready for the complexities of the live market.

    From data to decisions: Empowering the expert

    We’ve explored the “ingredients” and the “recipe,” but what does this actually look like in the hands of a marketing expert? Our goal isn’t just to crunch numbers; it’s to provide actionable recommendations that make experts more efficient and their campaigns more successful.

    Imagine a strategist coming to the platform with a specific mission:

    I’m launching a campaign for Brand B, targeting Audience A in Location X, with the objective of Increasing Awareness. What is the best platform and creative style to use?

    To answer a question like this, we need more than a search engine, we need a Predictive Engine, our “crystal ball”.

    Before we can offer a recommendation, we must train a Machine Learning (ML) model to understand performance. We teach it to look at millions of historical and synthetic data points to predict an outcome: Is this specific combination of elements likely to be Good, Average, or Bad?

    There isn’t just one way to build this crystal ball. In our research, we explore a spectrum of algorithms, including both traditional models and modern techniques. Each approach offers its own set of advantages: some prioritise speed, while others prioritise pinpoint accuracy. By testing across this variety, we ensure that when an expert asks for a recommendation, the answer is backed by the most robust mathematical thinking available today.

    1. The reliable workhorse: LightGBM

    We started with a classic, high-speed approach called LightGBM. Think of this as a highly efficient logic tree. It’s fast, dependable, and excellent at spotting clear patterns in structured data. It serves as our “baseline”, the standard we aim to beat.

    2. The specialist team: Neural Networks

    Next, we built a more sophisticated system, based on Neural Network architectures, that works like a well-organized corporation. We divided the AI into two stages:

    • Specialized departments: Each type of data (like your brand identity or your creative images) is handled by its own “mini-expert” that decides which details are actually important.
    • The executive board: Once the experts have done their work, a central “manager”, which is called MLP, looks at all the reports together to make the final call: Will this campaign succeed?

    In this category, we have experimented with multiple different architectures and techniques. For example, one of our best models, before making a prediction, it mathematically groups elements that “belong” together. If a specific high-energy image consistently drives high success when paired with a young, active audience, the model learns to pull those winning pieces closer. This not only makes the model smarter but also helps us give you much better recommendations for future pairings.

    3. The language experts: LLMs

    Finally, we tested whether a standard Large Language Model (like the ones used for chatbots) could do the job on its own. Interestingly, we found that “out-of-the-box” AI isn’t naturally great at these specific marketing predictions. However, when we provide specialized training (process called “Fine-Tuning”) their performance skyrockets, as evidenced by our research: From hype to impact: Predicting campaign performance with fine-tuned LLMs.

    The verdict: Measuring impact

    To evaluate our models and determine how accurately they predict campaign performance, we must first establish a rigorous testing ground. This involves two key components: the diversity of our data and the precision of our metrics.

    Datasets

    To ensure our findings aren’t just a “lucky” outlier, we don’t rely on a single source of information. Instead, we test every model against three different versions of our synthetic datasets. By proving that our models can perform consistently across various simulated environments, we can be confident that their predictive power is both reliable and adaptable to real-world shifts.

    While each of our datasets shares a consistent structure, we have intentionally varied their internal characteristics to put our models through a rigorous stress test. By using our Synthetic Data Generator, we can precisely control three key variables to create progressively more challenging environments:

    • Volume: Testing how the models perform with both limited information and vast amounts of data.
    • Balance: Adjusting the “label distribution”. For example, creating datasets where “average” results are far more common than clear successes or failures, to reflect the reality of a crowded market.
    • Signal strength: Tuning how obvious or subtle the patterns are, which forces the models to work harder to find the winning combinations.

    This approach ensures that our models aren’t just memorizing easy patterns, but are truly learning to find value in complex, “noisy” environments where the right answer isn’t always obvious.

    Model performance

    When it comes to measuring performance, we use a standard industry benchmark known as the F1 score, because simple accuracy can be a liar. Imagine you have a box of 100 fruits: 10 are apples and 90 are oranges. You build a robot to grab only the apples. If the robot sits still and does nothing, it is technically “90% accurate” because it correctly ignored the 90 oranges, but it’s a total failure at its job. The F1 score exposes this by balancing two hidden grades:

    • Precision (the “quality” grade): When the robot grabs a fruit and says “Apple,” is it right? High precision means it never accidentally grabs an orange.
    • Recall (the “completeness” grade): Did the robot find all 10 apples, or did it leave some behind? High recall means the robot is thorough and doesn’t miss any.

    The F1 score is a single number that averages these two. Unlike a normal average, it “punishes” extreme failure. If your robot is perfectly accurate but misses every single apple, its F1 score will be 0. This gives us a much more honest picture of how well a model actually works in the real world. To circle back to our case, we use the F1 scores in two ways:

    • The big picture: We report the Average F1 score across the entire dataset to show overall model health.
    • Performance by category: We break down results into three specific classes: Negative, Average, and Positive.

    This granular view is where the true business value lies. It allows us to ensure the model excels at the extremes, identifying the “Negative” combinations a marketer should avoid at all costs, and the “Positive” combinations that will truly drive results beyond the status quo.

    To bring structure to our innovation, we developed a centralized model Leaderboard. This platform serves as the definitive “source of truth” for our research team, ensuring that every breakthrough is measured against the same rigorous standards. The Leaderboard allows team members to download standardized training and testing splits for any dataset (whether real or synthetic) and submit their results for comparison. By centralizing our findings in one place, we achieve several key advantages:

    • True comparability: We can be certain that we are comparing equals across different algorithms and techniques.
    • Accelerated testing: It allows us to quickly and safely iterate on new ideas without reinventing the wheel.
    • Institutional knowledge: It creates a permanent record of our progress, ensuring that the best-performing models are always visible and ready to be deployed.

    This structured environment is what allows us to move from individual experiments to a scalable, high-efficiency engine for marketing AI.

    With our datasets defined and our Leaderboard in place, we put our models to the ultimate test. By measuring how each approach handled “Negative,” “Average,” and “Positive” campaign outcomes, we can clearly see which strategies offer the most reliable path to success.

    Here is a glimpse of how our top models performed across the board:

    DatasetModelOverall F1Neg F1Avg F1Pos F1
    Small and easyTree-based85.11%86.4%80.21%88.73%
    Deep Learning (v1)86.26%89.81%82.51%86.44%
    Deep Learning (v2)89.47%91.25%85.95%91.22%
    Small and slightly noisyTree-based85.54%87.48%80.92%88.21%
    Deep Learning (v1)84.80%89.41%80.50%84.50%
    Deep Learning (v2)88.84%91.13%85.26%90.14%
    Big and slightly noisyTree-based77.40%76.40%81.47%74.34%
    Deep Learning (v1)80.74%80.88%86.10%75.25%
    Deep Learning (v2)81.29%82.25%84.58%77.03%
    F1 score Performance of Top Models (Tree-based, Deep Learning v1, and Deep Learning v2) Across Varied Datasets

    Analysing the Leaderboard: Reliability at scale

    The results from our testing provide a clear picture of how these models handle real-world complexity.

    Our tree-based model remains a formidable workhorse, maintaining an F1 score above 85%. Most importantly, it demonstrates high accuracy in identifying “Positive” and “Negative” outcomes. This means the model is exceptionally reliable at flagging the two things marketers care about most: which campaigns are likely to be massive successes and which ones are headed for failure. While performance naturally dips as we introduce more noise and scale into the datasets, its baseline remains impressively high.

    While the classic models are strong, both of our Deep Learning approaches consistently take the lead. These models perform better because of their inherent capacity for “relational intelligence”, they can spot the subtle, complex connections that simpler, logic-based systems often miss.

    As the datasets grow larger and the patterns become more “noisy,” this deep understanding becomes a critical advantage. Seeing these models maintain performance above the 80% mark, even in the most challenging scenarios, gives us the confidence that our AI can handle high complexity scenarios.

    The privacy puzzle: Learning without sharing

    While our research shows how powerful these models can be, a significant question remains: How do we build an elite AI that learns from everyone, without exposing anyone’s private data?

    In the traditional world of marketing, building a “super-brain” meant pooling all client data into one giant, central database. In today’s world, that is a massive privacy red flag. We believe you shouldn’t have to choose between competitive intelligence and data security. To solve this, we utilize a cutting-edge approach called Federated Learning (FL).

    Think of Federated Learning like a team of specialized doctors working in different hospitals. To find a cure for a new disease, they don’t send their private patient files to a central office, that would be a breach of trust. Instead, each doctor studies their own patients locally, they discover what works and what doesn’t and then they share only the “recipe for the cure” with their colleagues, never the patient’s identity.

    In our ecosystem, each client trains the AI model locally on their own private data. The “lessons learned” are sent back to our main server, where they are combined to create a smarter, globally-informed model for everyone.

    The result? You benefit from a model that has “seen” millions of scenarios, yet your private data never leaves your hands.

    Discover more about Federated Learning by reading our post: Training together, sharing nothing: The promise of Federated Learning.

    Looking to the Future: The Next Frontier

    Our journey doesn’t end with a successful prediction. We are already exploring the next horizon of marketing intelligence, moving from understanding the past to actively designing the future. Here is what we are building next:

    1. Infusing data with “common sense”

    What if our models understood human psychology as well as they understand spreadsheets? We are exploring ways to inject the broad, contextual knowledge of LLMs directly into our training data. This gives our models a “common sense” layer, allowing them to understand the subtle cultural and social nuances that drive human behavior. Learn more about it in our post: The uncharted territory: Beyond the known data

    2. AI building AI

    We believe the best architect for a complex model might be the AI itself. Instead of manually designing every layer of a neural network, we are using advanced systems to automatically discover the ultimate model structure for marketing predictions. This process of “digital evolution”, which we delve into further in our AlphaEvolve article, ensures our tech is always one step ahead.

    3. From predictions to proactive recommendations

    We are currently building a tool that doesn’t just predict success, it suggests it. Imagine entering your brand and target audience and having the AI instantly recommend the perfect visual or message. We are perfecting this using two unique methods:

    • The Matchmaker space: Utilizing the “relational map” we built in our earlier phases to instantly pair your audience with the creative assets they are most likely to love.
    • The “hot or cold” optimisation: We treat our AI like a high-precision compass. If you have a brand and an audience but are missing the right “creative,” the system rapidly tests thousands of variations. It plays a high-speed game of “hot or cold” until it locks onto the highest possible performance score.

    By moving from educated guesswork to advanced, multimodal AI, we are finally bridging the gap between creative intuition and measurable results. The rocket is fueled, the coordinates are set, and the launch sequence has officially begun.

    Ready to explore the specifics? Read our full technical deep dive into Multimodal Fusion Models for a closer look at our methodology.

    Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP Research team.

  • The uncharted territory: Beyond the known data

    Our prior research has confirmed a fundamental truth: Machine Learning is exceptionally good at finding patterns in existing data. By analysing thousands of past campaigns, these models identify the threads of success and can predict outcomes for similar strategies with high reliability. This is an incredibly powerful tool for optimising what we already know.

    However, the real-world often presents us with a different challenge: The Unknown.

    • The data gap: While digital marketing generates vast amounts of information, it is rarely “clean” or perfectly integrated. Furthermore, when launching a new product or entering an entirely new market, historical data is often scarce or non-existent.
    • The novelty gap: Traditional Machine Learning excels at spotting correlations, but it can struggle with the unprecedented. What happens when a novel creative concept emerges, or a sudden social trend shifts audience behaviour overnight? Because the model hasn’t “seen” these shifts in the past, it may lack the context to predict the future.

    Bridging the gap: Can AI enhance our data?

    At the heart of our latest research is a fundamental question:

    What happens when we merge our proprietary data with the vast, world-level knowledge of a Large Language Model (LLM)?

    While “adding AI” is a popular trend, real business value isn’t a given. We set out to discover if an LLM acts as a true force multiplier that fills in missing pieces, or if it simply repeats what we already know, or worse, introduces “noise” that clouds our judgment. To find the answer, we tested two distinct strategies to enhance our historical campaign data:

    • Hybrid Graph creation: We build a digital “web” that connects our internal campaign facts with the LLM’s external context. This allows us to map out relationships between brands and audiences that our internal data alone might have missed.
    • Active Learning: Think of this as a focused “tutoring” session. We use AI to identify the most confusing parts of our data. By putting an LLM in the loop to address these specific gaps, the model learns exactly where it can provide the most clarity.

    By testing these methodologies, we are aiming to answer a critical, industry-defining question: Is the secret to superior performance simply more LLM integration, or does the true value still reside in the expert knowledge of marketing professionals that WPP has established over the years?

    The path forward: Testing the synergy

    To determine if AI-driven insights translate into real-world business value, we put two distinct methodologies to the test: Hybrid Graph creation and Active Learning. Each approach ensures that the LLM isn’t just a passive observer, but an active contributor to our strategy.

    To test this synergy, we utilised a specialised export from WPP’s proprietary dataset, ensuring anonymity and privacy. This data captures the full lifecycle of a campaign, including information on audience characteristics, geographical location and platform-specific delivery settings. Crucially, each entry includes a definitive label indicating the campaign’s objective and its final outcome, allowing us to measure success with high precision. For a deeper dive into the architecture and specific variables of this dataset, please refer to the Campaign Intelligence Dataset Pod.

    Strategy I: Hybrid Graph creation – beyond the spreadsheet

    Instead of looking at data as a simple list, we treat it as a dynamic relationship map. Imagine this network as a constellation where every individual data point, whether it’s a demographic like ‘woman,’ a platform like ‘Facebook,’ or a region like ‘Spain’, becomes a node. These nodes are interconnected by lines that represent the strength and nature of their relationships.

    To visualise this concept, here is an example of how this relationship network is mapped in our Hybrid Graph:

    Hybrid Graph example: each node is an attribute from the real dataset. Nodes are positively connected (green) if they perform well together, negatively connected (red) otherwise. Not connected nodes indicate average or non existing relation.

    By combining our internal campaign history with the LLM’s broader world knowledge, we create a “Hybrid Graph”. With this multi-layered map, we expect to be able to see connections that traditional spreadsheets ignore and the process of building it unfolds in three strategic phases:

    Phase 1: Establishing the ground truth

    Before we ask an AI for help, we perform a “Deep Dive” into our historical data to separate coincidence from repeatable success through two rigorous tests:

    • The consistency test (purity): We look for a clear “verdict”. For example, if a specific audience and brand pairing resulted in a positive outcome 85% of the time, we have a reliable pattern. The data is giving us a clear “Yes”.
    • The volume test (cardinality): Consistency only matters if it happens often. For instance, if a positive connection appears consistently, we know it’s a statistically significant trend, not just a stroke of luck.

    By filtering through these lenses, we identify the bedrock of our dataset.

    Phase 2: The expert second opinion

    Next, we turn to the “Emerging Patterns”: combinations where the data shows a clear leaning (like a 60% success rate), but the evidence isn’t yet overwhelming. Historically, these “maybe” scenarios might have been ignored. Now, we invite the LLM to act as a Strategic Consultant.

    • The power of the upvote: When the LLM’s intuition aligns with our data’s hints, we gain a new level of confidence. For example, if a location and brand pairing resulted in a positive outcome 60% of the time, we suspect there is a pattern, but we need the LLM to confirm.
    • Validation through synergy: By getting a “Yes” from the AI to back up our data, we move these patterns from the “maybe” pile into our active knowledge base.

    Phase 3: Illuminating the “dark spots”

    Finally, we shift our focus to the areas our data couldn’t reach, the “dark spots”. These are combinations that were excluded because our data was too noisy or the scenarios were entirely new.

    We identify every combination where we currently lack confidence. However, this is a vast number of combinations, making it computationally infeasible to check all of them. That’s why we sample these gaps and ask the LLM for original insights based on its understanding of global markets. To give an example, the cases we’re targeting here look like: a specific audience and location pairing that is non-existent in the dataset, or a combination that results in a positive outcome half the time and a negative outcome the other half. Because of the lack of a clear pattern in the real data, we directly ask the LLM.

    In short, this allows us, through the LLM, to clarify noisy data and predict outcomes for entirely new scenarios.

    The challenge of illuminating our data “dark spots” led to a major breakthrough. Our first approach, the Hybrid Graph, essentially looks at low-signal campaign data and uses LLMs to make an educated guess to fill in the gaps. But this sparked a bigger, more strategic question: Instead of just guessing what’s in the dark, how can we actively hunt for the exact pieces of missing information that will make our predictions smarter? Out of millions of possible campaign combinations, how do we pinpoint the specific scenarios that will teach our model the most? This strategy of “smart hunting” forms the foundation of our second approach: Active Learning.

    Phase 4: Combining everything together

    The culmination of this research is a unified Hybrid Graph. By merging our proven history, our validated suspicions and our newly discovered insights, we create a living map of intelligence.

    The result is a specialised dataset that is expected to offer the best of both worlds:

    • The grounding of reality: Rooted in the hard facts of our actual campaign history.
    • The foresight of AI: Enhanced by the vast, contextual knowledge of the LLM.

    Strategy 2: Active Learning – solving the puzzle of uncertainty

    Where the Hybrid Graph fills gaps with new insights, Active Learning focuses on a different truth: data isn’t always helpful if it’s redundant. To truly advance our models, we don’t need more of what we already know; we need clarity in the “grey areas” of our knowledge.

    For example, imagine our data clearly shows that “TikTok campaigns” aimed at “Gen Z” are consistently successful, while “LinkedIn campaigns” aimed at “Millennials” usually underperform. But what happens if we want to run a “TikTok campaign” for “Millennials”? The model might be completely unsure if there is no clear pattern for that specific combination. Instead of analysing thousands more Gen Z campaigns we already understand, Active Learning specifically targets this exact missing combination. By resolving this one grey area, the model learns whether the platform or the audience age is the true driver of performance.

    In the world of data, this uncertainty occurs when there isn’t a strong, consistent signal: when parts of a dataset tell conflicting stories, making it difficult to separate real patterns from mere noise.

    In modern marketing, the number of possible combinations between audiences, brands and locations is astronomical; blindly analysing every single one would be incredibly slow, if not impossible. Instead, we use Active Learning as a strategic guide to identify the specific “pockets” of a dataset where our current models are struggling the most. It sifts through the records and picks only the most confusing, yet valuable, points for evaluation.

    By focusing our efforts strictly on the areas where the model is most uncertain, we achieve two major goals:

    • Maximised intelligence: We gain the most knowledge from the fewest possible data points.
    • Operational speed: We bypass the “noise” of what we already know, allowing us to build high-performing models in a fraction of the time.

    Ultimately, this approach turns a daunting, “infinite” dataset into a manageable, high-impact asset.

    The LLM as our “oracle”

    Identifying the most uncertain points in our data is only half the battle; the real value lies in what we do with them. Once we have selected these high-priority “grey areas,” we bring in the LLM to act as our oracle.

    Using sophisticated prompting techniques, we present these uncertain points to the AI for a professional verdict. Our goal is to transform these pockets of doubt into certainty, backed by high-quality expert information.

    By doing this, we effectively bridge the “information gap”. We aren’t just adding more data for the sake of volume; we are harvesting targeted knowledge. This process turns a previously unknown variable into a strategic asset, ensuring that our final model isn’t just a reflection of what we’ve seen before, but a fusion of our experience and the AI’s broader market expertise.

    Two paths to higher intelligence

    To find the most efficient way to “teach” our models, we experimented with multiple different strategies for choosing which questions to ask our LLM oracle. Below, we outline our core foundational technique and the more advanced method that has proven to be our most effective to date.

    Approach 1: The broad search

    This is a high-level “scouting” mission. We create a large pool of random potential campaign scenarios and ask our current model to predict how they would perform. We then identify the scenarios where the model is the most confused, the “shaky” predictions, and send those directly to the LLM oracle for a definitive answer. It’s a fast, effective way to shore up general weaknesses in our knowledge.

    Approach 2: The targeted stress test (our top performer)

    Our most successful approach is much more surgical. Instead of looking at random scenarios, we actively look for the “tipping points”, the exact moment a campaign shifts from being a success to a failure, or vice versa.

    • Finding the edge: We take a known successful campaign and a known failure, then subtly blend their features to create a new, “borderline” scenario.
    • Measuring confusion: We keep adjusting the features until a pre-trained auxiliary model (in this case, a tree-based one) flips its prediction. We then rank and select the scenarios where the outcome is most uncertain, ensuring we capture the most informative data points for our oracle to review.
    • The expert verdict: We present these precise “tipping points” to the LLM oracle. By giving the AI specific examples of similar successes and failures as context, we get an incredibly high-quality label.
    • Iterative learning: Once the LLM provides the answers for these “grey areas,” we integrate them into our official records. We then retrain our auxiliary model on this newly enriched dataset, making it instantly more precise. From there, the process begins again, creating a continuous loop that proactively hunts for and eliminates our model’s remaining blind spots.
    The Active Learning Loop with the four main phases: training existing data, finding the tipping points, labelling them using an LLM, and finally adding them to existing data to start the loop over.

    By repeating this process, we don’t just add data; we specifically “fix” the model’s most significant blind spots. This iterative loop ensures that our final engine isn’t just bigger but it’s also significantly smarter.

    Results

    The balancing act: Extracting the final datasets

    Building a Hybrid Graph is a delicate exercise in calibration. Our challenge was to find the perfect equilibrium: How much should we trust our internal data and how much “weight” should we give to the LLM’s external knowledge?

    To test this, we generated several different graph versions, eventually selecting the largest and most robust one. This ensured our Synthetic Data Generator had a dense enough “knowledge web” to create high-quality, non-random datasets. To keep our findings clear, we kept environmental “noise” to a minimum, ensuring we were testing the core intelligence of the graph itself.

    Similarly, when building the datasets to test our Active Learning strategies, we had to find the right blend of human experience and AI insight. After testing multiple configurations, we discovered our “Golden Ratio” was in the region of 80% Real-World Data and 20% LLM Knowledge. This 80/20 balance proved to be our most effective setting. It ensures the model remains firmly grounded in the proven reality of WPP’s historical success, while still allowing enough “AI intuition” to fill in the gaps and explore new strategic frontiers.

    The reality check: Lessons from the data

    To evaluate the results, we ran a “head-to-head” test. We trained one model using only real-world data and another one using our LLM-enhanced hybrid dataset. We then tested both against a “holdout” set of real campaign results.

    Here are the results of our models, trained on the real dataset and tested against the holdout:

    ModelOverall F1Neg F1Avg F1Pos F1
    Tree-based60%54%72%55%
    Deep Learning67%60%79%63%
    Models’ Baseline Performance (trained and tested on real-world data)

    Building on the foundations of our previous research (From guesswork to foresight: How AI is predicting the future of marketing campaigns), we transitioned our models from a controlled synthetic environment to the complexities of 100% real-world campaign data.

    Our standard models, which previously proved their strength in synthetic testing, delivered a highly competitive baseline. This “Reality Benchmark” set a high bar, while simultaneously identifying clear opportunities for our LLM-based techniques to add value.

    The results revealed a clear trend: while the models excelled at identifying “Average” campaigns, they struggled to pinpoint the extreme “Positive” or “Negative” outliers. This is a common phenomenon in real-world marketing. Unlike our controlled synthetic environments, where we can perfectly balance the ratios, real-world data is heavily weighted toward “average” outcomes. Exceptional successes and disasters are rare, making them significantly harder for a model to learn and predict.

    Within this context, Deep Learning (v2) emerged as our strongest baseline, achieving a solid 67% overall F1 score. The Tree-Based Approach performed slightly below the Deep Learning architecture, reinforcing our decision to move toward more “relational” neural networks to navigate the noise and imbalance of complex marketing datasets.

    By establishing this 67% mark as our “Line in the Sand,” we can clearly measure the true impact of our Hybrid Graph and Active Learning interventions. Here is how our LLM-enhanced methodologies performed:

    MethodVariantModelOverall F1Neg F1Avg F1Pos F1
    Hybrid
    Graph
    13k rows 60% densityTree-based37%51%10%50%
    Deep Learning17%11%2%37%
    90k rows 60% densityTree-based42%57%10%60%
    Deep Learning24%14%8%35%
    Active LearningBroad Point Search
    Real 80%, LLM 20%
    Deep Learning66%59%77%62%
    Targeted Point Search
    Real 78%, LLM 21%
    Deep Learning68%60%80%63%
    Hybrid Graph & Active Learning Performance
    Performance comparison of the different models and experiment by metrics.

    The findings in the table above were unexpected, but deeply insightful: We noticed a significant drop in performance when the LLM was added to the loop for hybrid graph and almost no increase with active learning.

    The hybrid graph challenge: A significant divergence

    The most striking finding was the performance of the Hybrid Graph. Despite increasing the data volume to 95k rows, the scores dropped significantly, bottoming out at 17% to 42%.

    This drop reveals a fundamental truth: generic LLMs are trained on public domain knowledge. They lack the specialised, proprietary marketing intelligence that WPP possesses. By weaving general AI “intuition” into a specialised graph, we introduced noise that actively diluted the high-quality signals of our real-world data.

    Even with a denser graph, our models struggled to maintain a consistent F1 score. This proves that marketing success hinges on the niche, proprietary data unique to our field, information that simply isn’t available in the public sphere, rather than just a larger volume of generic data.

    Active Learning: Reaching the efficiency frontier

    In contrast, our Active Learning strategies, specifically the Targeted Point Search, successfully met the benchmark. Using our “Golden Ratio” (78% Real / 21% LLM), the Targeted Point Search achieved a 68% F1 score, slightly outperforming our best real-world baseline.

    While our Targeted Point Search allowed us to maintain performance levels comparable to our 100% real-world baseline, we have to be honest: we expected a more significant leap. To justify a process of this complexity, the “performance lift” needs to be undeniable. This brings us to two critical, strategic questions:

    • The quality risk: For such a marginal improvement in accuracy, is it worth introducing external AI “intuition” into our proprietary ecosystem when we cannot be 100% certain of its quality?
    • The computational cost: Does the slight increase in predictive power justify the high computational expense and the mathematical difficulty of hunting for these “tipping points”?

    In its current state, the answer is a cautious “No”. While the technology is fascinating, the results prove that our internal, high-fidelity data is already doing the heavy lifting. Introducing expensive, public-model “noise” for a 1% gain doesn’t just challenge our efficiency, it risks diluting the “Gold Standard” intelligence that WPP already possesses.

    The strategic conclusion: Expert-led AI

    Our research serves as a powerful reminder that AI is a force multiplier, not a replacement. The performance drop we saw with the Hybrid Graph Dataset underlines the immense competitive advantage of WPP’s proprietary data; generic models simply cannot replicate the “niche” intelligence we already possess.

    While Active Learning was able to match our 67% baseline, “matching” the status quo is not enough to justify the hype or the computational cost.

    The core insight: Data quality is the ultimate moat

    This research proves a fundamental truth: Data quality is everything. A generic AI cannot replace the deep, specialised expertise of a marketing professional. The failure of the “public” LLM to improve our results demonstrates that the real path to success lies in keeping our experts in the loop. By using high-fidelity, professional strategy rather than general internet trends, we ensure our models are learning from the best in the business.

    Moving forward: From baseline to breakthrough

    To bridge the gap between “adequate” and “exceptional,” we have identified two clear technical paths to evolve this research:

    1. The fine-tuned oracle: Our current experiments used “off-the-shelf” LLMs. To truly elevate the results, the next logical step is to use fine-tuned models: AIs that have been specifically trained on WPP’s historical successes and internal playbooks. This transforms the oracle from a generalist into a marketing specialist.
    2. Real-world Active Learning: The ultimate validation of Active Learning isn’t a digital oracle; it’s the market itself. A Real-Time Loop can be implemented using Active Learning, to identify high-potential “blind spots,” launching those as live test campaigns and then feeding that real-world performance back into our models. This moves us from theoretical testing to real-world evolution.

    Ready to explore the specifics? Read our full technical deep dive into Data Enrichment Pod for a closer look at our methodology.

    Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP Research team.

  • Brand Perception Atlas: Mapping the modern brand, from social signal to core equity

    Introduction

    A brand is what people repeatedly, collectively, and emotionally decide it is, not simply what the company declares.

    In 2026, those decisions are happening everywhere: on Instagram, TikTok, in reviews, in news cycles, in comment sections, and in the accumulated memory of long-term brand equity.

    For organisations attempting to understand brand perception, this environment presents a fundamental challenge. The volume of available data is unprecedented, yet the signals it produces are often inconsistent and contradictory. Traditional research tools such as surveys and focus groups remain essential for measuring brand equity, but they capture perception only at specific points in time and cannot fully reflect the fast-moving nature of digital conversation.

    At the same time, social media offers a continuous stream of public commentary, revealing how brands are discussed, interpreted, and compared in everyday discourse. However, these signals are noisy and difficult to interpret in isolation.

    The Brand Perception Atlas was built to turn that noise into a map, designed to integrate diverse perception signals into a unified analytical framework. By combining social media data, public knowledge sources, Large Language Model (LLM) summaries, and established brand equity research, the Atlas aims to provide a more comprehensive understanding of how brands are perceived across the digital ecosystem.

    The first iteration of this project analysed perception signals for more than 200 brands and over 4,000 source items across the included sensors, in 2025–26 for the US market, enabling the construction of a visual representation of brand perception that reveals relationships, consistencies, and divergences across multiple sources.

    A short, video walkthrough of an anonymised version of the Atlas can be viewed below ⬇️

    The Brand Perception Atlas

    The Brand Perception Atlas functions as a navigational system for brand strategy. Individual data points reveal little on their own, but when thousands of signals are mapped together, larger patterns come into view.

    The Brand Perception Atlas turns those scattered signals into a shared perceptual map, showing how brands cluster, where they compete, and which meanings they occupy in the public imagination.

    To achieve this, the Atlas synthesises perception signals from several sources:

    • Content from official brand accounts on platforms such as TikTok and Instagram
    • Public narratives reflected in sources like Wikipedia
    • AI-generated summaries, using Gemini, describing how brands are perceived in LLM-based discourse.
    • To anchor the analysis in long-term brand perception, the system incorporates survey-based brand equity data from the WPP Brand Asset Valuator® (BAV).

    BAV holds a special place among brand-perception sensors. Developed by WPP, BAV is one of the world’s largest and longest-running brand equity studies, spanning more than three decades, thousands of brands, and multiple markets. Unlike social and digital signals, which infer perception from public behaviour, BAV measures it directly by asking consumers what they believe. That makes it the Atlas’s anchor: not a snapshot of what people are saying today, but a benchmark for what they have come to believe over time.

    BAV captures this through 48 standardised imagery attributes, from functional traits such as Reliable and High Quality to more emotional cues such as “Charming”, “Daring”, and “Friendly”. In the Atlas, those attributes provide a high-resolution view of brand meaning, making it possible to see exactly which dimensions of perception define its underlying equity.

    Table 1 presents samples from all Atlas sensors for two well-known brands: Brand A (a major US retailer) and Brand B (a global travel platform). The table illustrates how the same brand can be perceived very differently across sensors, highlighting the need for a tool like the Brand Perception Atlas, which brings these diverse perspectives together into a single, contextualised view.

    Sample brand perception reports
    SourceBrand ABrand B
    Survey (BAV)(Long-term equity)“Consumers perceive this brand as a highly accessible and dependable choice, offering excellent value for money. It consistently earns praise for its reliable, high-quality, and original offerings.”“Perceived as a highly original, daring, and progressive leader in the travel space. It is seen as a ‘cool’ and ‘friendly’ brand that offers unique, high-quality experiences.”
    Wikipedia
    (public narrative)
    “Positioned as offering upscale products at below-average costs, appealing to a younger, more educated, and higher-income demographic.”“A global travel platform. However, its narrative is often complicated.”
    Gemini LLM
    (digital discourse)
    “Perceived as a clean, organised, and pleasant one-stop-shop that blends everyday necessities with trendy, affordable finds.”“Widely viewed as a pioneer of authentic travel, praised for design and convenience.”
    Instagram
    (brand official account)
    “An accessible and enjoyable retail destination… The playful and organised presentation of shopping reinforces a positive, discovery-driven customer experience.”“A visually stunning showcase of bucket-list stays and architectural marvels. The brand projects an aspirational yet community-focused vibe.”
    TikTok
    (brand official account)
    “Widely seen as a trendy, accessible retailer offering stylish, curated products and collaborations.”“Highly energetic and trend-focused. The vibe is one of discovery and adventure, making global travel feel personal and attainable.”

    Table 1: Sample brand perception reports, for Brand A and Brand B, from the Brand Perception Atlas dataset

    Core brand identity

    When a focus brand is selected, the Atlas reveals its core brand identity: a breakdown of the top perceptual themes by channel. This provides an at-a-glance view of what each sensor is saying about the brand, and where those signals align or diverge.

    For example, selecting Brand C (a global technology company) surfaces a clear identity profile. On BAV, the brand is perceived as Reliable, Innovative, and Intelligent. On Wikipedia, the dominant themes shift to Innovative, Dominant, and Controversial. Each sensor contributes a different angle, but together they reveal the full perceptual picture.

    Figure 1: Screenshot of the selected brand’s core brand identity

    Mapping brand perception

    At the centre of the Atlas is the Perception Map, a visual representation of how brands relate to one another in terms of public perception.

    Each perception signal is converted into a numerical representation of semantic meaning, known as an embedding. A helpful way to think about semantic meaning is as the “essence” of a word, rather than the word itself. For example, although “Luxury” and “High-end” are lexically different, they convey very similar meanings. The Atlas uses semantic Gemini embeddings to capture these relationships and understand the semantic similarity between terms such as “Luxury,” “Premium,” “Prestigious,” and “High-End.” Because their meanings are closely related, the system places brands characterised by these words in the same neighbourhood on the map.

    The resulting map resembles a landscape of brand meaning, allowing brand leaders to identify clusters of brands that share common associations and spot outliers that occupy distinctive perceptual positions.

    Omnichannel consistency

    One of the most useful insights derived from the Atlas is a metric that we refer to as omnichannel consistency. This measure evaluates how closely aligned a brand’s perception is across different information sources. If the signals derived from social media, surveys, and public narratives cluster tightly together, the brand is communicating a consistent identity. Conversely, if these signals are widely dispersed, the brand’s presence is more diverse.

    Analysis of the dataset identified several brands with exceptionally strong consistency across channels, including Brand D (an industrial equipment manufacturer), Brand E (a heavy equipment manufacturer), and Brand F (a health insurance provider), each showing more than 99% omnichannel consistency. Brand D is a clear example: its core perception of rugged reliability remains stable whether measured through a 30-year longitudinal BAV survey or reflected in viral TikTok videos. All Brand D signals appear tightly clustered on the perception map across every sensor.

    The brand’s digital content visualises a bridge between historical heritage and modern utility. This alignment is not merely aesthetic but is deeply rooted in a consistent cross-platform narrative.

    • On BAV, the brand is perceived as a “formidable and dependable leader,” successfully blending a “rugged” and “traditional” foundation with a “distinctively cool and stylish” upper-class appeal. It is consistently rated as a “best-in-class” choice that is “worth more,” signalling its status as a premium investment.
    • On Gemini, the narrative reinforces this by projecting the “vibe of a legacy American icon built on reliability.” The discourse centres on the “durability and longevity” of the equipment, where the brand’s well-known slogan continues to underpin a reputation for high performance.
    • On social sensors, the brand’s perception is characterised by “practical innovation”. Reports highlight a “strong brand affinity” built through “real-life customer stories centred around dedication and essential work.” While the content celebrates the “versatile and dependable” nature of the machinery for tasks like snow removal, it also reflects a modern tension: the high cost of entry and proprietary technology, which mirrors the BAV finding that the brand is perceived as a “significant investment.”

    At the other end of the spectrum, Brand G (a global hospitality company) occupies distinct perceptual territories across platforms, with its social media presence emphasising aspirational luxury and curated experiences, while its long-term equity centres on reliability and traditional prestige. As shown in Figure 2, the Atlas surfaces this as a clear pattern, one that may reflect a deliberate strategy to engage different audiences through different channels.

    • The BAV sensor (bottom right in Figure 2) positions the brand within a territory of reliability, superior quality, and prestigious appeal, with BAV reports describing it as “a beacon of established excellence and prestige… commanding a perception of leadership and trustworthiness firmly rooted in tradition.”
    • The social sensors (Instagram and TikTok, upper right corner in Figure 2) position the brand in a complementary territory defined by aspirational luxury and curated escapism. In these digital spaces, Brand G’s presence is human-centric and inclusive, with official reports describing an organisation that “champions individuals” and demonstrates “deep cultural understanding.”

    Figure 2: Screenshot of the Brand Perception Atlas showing the dispersed points of Brand G

    The variance in the data speaks to a brand successfully managing a legacy reputation while aggressively chasing a modern, inclusive digital identity (see Table 2).

    SensorKey perceptual snippetCore narrative
    BAV“This brand is primarily perceived as highly reliable… a beacon of established excellence and prestige. It commands a perception of leadership and trustworthiness firmly rooted in tradition.”The anchor: Focuses on dependability, classic excellence, and proven heritage.
    Instagram“Brand G… projects a vibe of inclusivity, reliability, and cultural understanding. Praised for its global reach, diverse workforce, and initiatives focused on social responsibility.”The bridge: Humanises the giant; focuses on “people first” and cultural connectivity.
    TikTok“Cultivates an aspirational and exclusive vibe… promoting global travel and luxury experiences… purveyor of curated escapism and wellness.”The future: Targets the frequent traveller seeking sophisticated, unique retreats and indulgence.

    Table 2: Sample brand perception text, for Brand G, from the Brand Perception Atlas dataset

    Interestingly, multiple travel-related brands exhibited high dispersity across the semantic perception space. The high volume of online discussion surrounding travel experiences, ranging from positive stories to customer complaints, may contribute to a more fragmented perception environment for brands in this sector.

    Omnichannel consistency is not inherently good or bad. Some brands benefit from a tightly aligned identity across platforms, while others thrive by expressing different facets of themselves in different contexts. In categories such as entertainment, fashion, and travel, more fragmented perception may reflect adaptability and cultural relevance rather than weakness.

    For this reason, the consistency metric is diagnostic, not prescriptive. It shows where a brand sits on the spectrum between a unified and multi-faceted perception, helping leaders assess whether that pattern aligns with their strategy.

    Shared equity, different vibe (close on BAV, far on socials)

    The Atlas also reveals unexpected relationships between brands in completely different industries. One of the clearest patterns appears when two brands share a similar equity foundation but project very different identities on social media.

    A good example is Brand H (an industrial conglomerate) and Brand G. At first glance, they are not intuitive neighbours. Brand H is associated with science, engineering, and industrial innovation, while Brand G is associated with hospitality, travel, and aspirational leisure. On social media, they occupy very different cultural spaces, and in the Atlas’s social sensors they sit far apart.

    SourceBrand HBrand G
    BAV“Projects a vibe of rugged, energetic reliability combined with a visionary, original spirit… praised for its trustworthiness and distinctively high quality.”“Perceived as superior quality and high-performing… described as unique, stylish, and authentically simple. Appeals to a sophisticated, aspirational lifestyle.”
    Gemini“Deeply divided: seen as a legacy American innovator of household staples, but its reputation is tarnished by high-profile legal and environmental controversies.”“A vast corporate giant in hospitality. Reputation has a dual identity: a provider of aspirational luxury vs. an impersonal entity with inconsistent service.”
    Instagram“Science and technology powerhouse, praised for its problem-solving capabilities. Emphasises STEM education and its role in enabling future technologies.”“Cultivates a multifaceted image as a global provider known for community, sustainability, and inclusivity. Projects a vibe of reliability and cultural understanding.”

    Table 3: Sample brand perception reports, for Brand H and Brand G, from the Brand Perception Atlas dataset

    Content such as Brand H’s Instagram posts around a youth science initiative creates a perception centred on innovation, education, and community. The brand appears inspiring, responsible, and forward-looking. This is very different from Brand G, whose social presence is shaped by luxury, travel, and experience.

    However, the BAV sensor tells a different story. At the level of deeper brand equity, Brand H and Brand G emerge as close neighbours because both are anchored by reliability and leadership. In the consumer mind, Brand H functions as an innovation backbone, while Brand G functions as a service backbone. Their social expressions differ, but their underlying equity plays a similar emotional role: both are seen as dependable institutions.

    A similar pattern appears with Brand I (a membership retail chain) and Brand J (a US airline), as shown in Table 4. These brands belong to very different categories, yet on the BAV sensor they appear close together within a shared “consumer champion” territory. Both brands are anchored by associations such as friendliness and reliability. Brand I is strongly linked to simplicity, while Brand J is associated with value. At a foundational level, both occupy a similar emotional space: trusted brands that provide essential services without the friction consumers often expect from their industries.

    SourceBrand IBrand J
    BAV“Predominantly perceived as high-value, reliable, and authentic… offers helpful and intelligent solutions while demonstrating a commitment to equality.”“Largely perceived as fun, cool, and friendly… valued for its distinctiveness and high quality, contributing to a perception of being trendy and energetic.”
    Wikipedia“A highly successful, global membership-only club known for its value and its strong private-label brand.”“A hybrid low-cost carrier that disrupted the airline industry with premium amenities. Recently faced scrutiny over alliances and operational reliability.”
    Gemini“Widely perceived as a members-only ‘treasure hunt,’ fostering a cult-like loyalty. Praised for unbeatable value, but criticised for a chaotic in-store experience.”“A trendy, modern airline struggling to live up to its reputation. While the in-flight experience is praised, operational reliability is a significant pain point.”
    Instagram“Beloved, value-driven club offering a unique shopping experience. Transformed mundane shopping into a leisure activity and shareable ‘haul’ content.”“Projects a customer-centric and approachable vibe. Highlights above-average amenities and playful brand interactions compared to typical budget carriers.”

    Table 4: Sample brand perception reports, for Brand I and Brand J, from the Brand Perception Atlas dataset

    This pattern is diagnostically useful, as it shows that brands can share the same equity backbone while expressing themselves very differently across platforms. Brand I doubles down on functional value, while Brand J leans into aspiration and lifestyle.

    Despite operating in different categories and producing very different content, these brands serve the same emotional role for consumers.

    Different equity, shared vibe (far on BAV, close on socials)

    The Atlas also reveals brands that follow the opposite pattern: brands from very different industries that share little underlying equity, yet converge into a similar “vibe” on social media. In these cases, the social layer acts as a cultural blender, pulling very different brands into the same perceptual neighbourhood.

    A strong example is Brand K (a packaged food company) and Brand L (a food and beverage manufacturer). At the level of long-term brand equity, these brands occupy distinct positions. Brand K is perceived as a “highly reliable, high-performance leader” whose vibe blends “traditional prestige” with a “glamorous and daring appeal.” Brand L, by contrast, is seen as an “original and unique leader” that fuses a “traditional foundation with a trendy, cool, and dynamic aesthetic,” praised for its “rugged appeal and fun character.”

    According to the BAV sensor, while both are established food brands, they occupy meaningfully different perceptual territories: Brand K anchored in prestige and reliability, Brand L in originality and rugged charm.

    Yet on social media, the distinction between them fades. On Instagram and TikTok, both brands converge into a shared neighbourhood of nostalgic comfort and family-friendly Americana.

    According to these social sensors, Brand K “cultivates a vibe of reliable convenience and family-friendly nostalgia,” engaging audiences with creative recipe ideas rooted in American food culture. Brand L similarly “evokes a strong sense of nostalgic comfort and reliable quality,” projecting tradition, community, and corporate social responsibility. On social platforms, both brands occupy the same emotional space: trusted pantry staples working to stay relevant with modern, health-conscious consumers.

    SourceBrand KBrand L
    BAV“Perceived as a highly reliable, high-performance leader offering exceptional quality. Its vibe is one of traditional prestige blended with a glamorous and daring appeal, reflecting an intelligent and customer-caring image.”“Widely perceived as an original and unique leader, skilfully blending a traditional foundation with a trendy, cool, and dynamic aesthetic. Praised for its high quality, rugged appeal, and fun character.”
    Wikipedia“A long-standing American multinational food manufacturer with a diverse portfolio. Focused on expanding health-conscious offerings and sustainability, though it has faced scrutiny over legal terms and health claims.”“A long-established and highly diversified American food and beverage manufacturer. Known for an aggressive acquisition strategy that has transformed it into a Fortune 500 company with a broad portfolio.”
    Gemini LLM“Widely perceived as a dependable, nostalgic American staple increasingly viewed through a lens of corporate scrutiny. Discussions often focus on high sugar content.”“Quintessential American heritage brand evoking nostalgia and comfort. While loyalty is built on consistent taste, the brand’s reputation has been challenged.”
    Instagram“Cultivates a vibe of reliable convenience and family-friendly nostalgia. Engages a wide audience with creative recipe ideas while facing criticism regarding the nutritional content of its processed foods.”“Projects a vibe of a reputable company with an emphasis on tradition and community. Praised for its deep-rooted history and significant corporate social responsibility initiatives, including sustainability advancements.”
    TikTok“Presents a comfortable, family-friendly image rooted in American food culture. Struggles to fully resonate with health-conscious consumers due to a perception of limited innovation in its core processed staples.”“Evokes a strong sense of nostalgic comfort and reliable quality. Working to engage younger audiences by promoting modern, sustainable practices and creative usage of its familiar pantry staples.”

    Table 5: Sample brand perception reports, for Brand K and Brand L, from the Brand Perception Atlas dataset

    In this digital layer, Brand K’s recipe-driven content and Brand L’s heritage-focused storytelling occupy the same neighbourhood. They are no longer distinguished by their different equity profiles; instead, they are unified as nostalgic American pantry brands that use family-friendly content and tradition to build emotional resonance with their audiences.

    Conclusion

    Understanding brand perception has always been a central challenge in marketing and brand strategy. In the digital era, the challenge is no longer a lack of information, but an excess of it. Organisations now face a paradox: more data than ever, yet less clarity about what it actually means.

    In a world where social media amplifies attention without always reshaping perception, the real strategic advantage lies in understanding the gap between visibility and meaning. The Brand Perception Atlas makes that gap visible. It shows where brands cluster, where they drift, and where surface-level conversation either reinforces or obscures deeper brand equity. In doing so, it helps brand leaders understand not just what people are saying today, but how those conversations connect to the deeper beliefs that shape brand meaning over time.

    The real challenge is not simply tracking what a brand did yesterday. It is understanding what that brand means, what territory it occupies in people’s minds, and how difficult that territory is to shift.

    Ready to explore the specifics? Read our full technical deep dive into the Brand Perception Atlas Pod for a closer look at our methodology.

    Disclaimer: This content was created with AI assistance. All research and conclusions are the work of WPP Research.

  • Training together, sharing nothing: The promise of Federated Learning

    Why Federated Learning now?

    In marketing, data is a competitive edge. The more audience signals, campaign performance data, and consumer behaviour a Machine Learning (ML) model can learn from, the sharper its predictions and the greater its business impact. Across the marketing ecosystem, spanning agencies, brands, and technology partners, organisations individually hold rich and valuable datasets. The potential to learn shared patterns across these assets, without ever exposing proprietary information, could unlock new capabilities: better audience targeting, smarter media spend, and faster creative optimisation at a global scale.

    But here’s the challenge. Across the marketing industry, the most impactful data is inherently distributed. Each organisation, whether agency, brand, or technology partner, holds a unique piece of the data puzzle. Client contracts, privacy regulations like GDPR, and the sheer sensitivity of consumer-level data mean this data has to stay within each organisation’s walls. This is a structural reality of the industry, not a limitation of any single company. The question is whether there is a way to learn collectively from this distributed knowledge without compromising the privacy boundaries that exist for good reason.

    The traditional solution, centralised ML, pools raw data from multiple sources into a single cloud to train a global model. But uploading terabytes of sensitive data to a central server creates severe network latency and exposes collaborators to data breaches and potential violations of privacy regulations.

    Distributed ML methods attempted to address this by splitting training across local worker nodes. Whilst this reduces latency and avoids centralising raw data, these architectures were designed for internal computing clusters, not secure collaboration between independent companies. Without cross-organisation coordination, each organisation’s models are limited to what their own data can teach them, with no mechanism to benefit from shared learning.

    Problem: The collaboration vs. privacy bottleneck

    Organisations face a fundamental tension: gaining the benefits of shared learning typically requires centralising sensitive data, which privacy and contractual obligations rightly prevent. Yet without a way to learn collectively, each organisation’s models are limited to their own data alone. Neither traditional architecture offers a path to collaborative model improvement whilst keeping private data strictly where it belongs.

    Federated Learning (FL) offers a way out of this dilemma by bringing the model to the data, rather than the other way around. To understand why this shift matters, let’s look at how FL actually works under the bonnet.


    How Federated Learning works

    Figure 1: Overview of the Federated Learning communication cycle between a central node and distributed client nodes.

    Federated Learning (FL) enables multiple organisations to collaboratively train a shared model without ever centralising raw data. Instead of moving data to the model, FL brings the model to the data. Training proceeds through iterative rounds:

    1. A central server sends the current global model to all participating nodes (Blue arrow).
    2. Each node trains the model locally on its own private data (Green arrow).
    3. Nodes send back only their model updates, never the underlying data (Pink arrow).
    4. The server aggregates these updates into an improved global model and starts the next round.

    Throughout this process, raw data never leaves its source. Only learned model representations are exchanged across the network.

    The multimodal challenge

    Whilst the above privacy-preserving framework is valuable in its own right, modern marketing data adds another layer of complexity. Organisations do not just work with spreadsheets and numbers. They work with images, video, text, audio, and structured data, often all at once. A single campaign might involve visual brand assets, ad copy, audience segments, and performance metrics across channels. Training models that can reason across these different data types, known as multimodal learning, is already one of the most demanding challenges in ML.

    Now combine that with the constraints of federated learning. Each client may hold different combinations of modalities, in different formats and volumes. One partner might contribute rich visual data, another mostly text and tabular records. Coordinating a single global model that learns effectively from this fragmented, heterogeneous landscape, without ever seeing the raw data, pushes the problem to a new level of complexity.

    This is precisely what makes the intersection of FL and multimodal learning so important, and so hard. If it can be made to work, it unlocks collaborative intelligence across organisations at a scale that neither approach could achieve alone.


    Our objective: can Federated Learning deliver?

    The promise of FL is compelling, but before investing in real-world deployment, we need to answer a fundamental question:

    Does federated learning actually work well enough on multimodal marketing data to justify the tradeoff?

    Centralised training will always have an inherent advantage, it sees all the data at once. The question is not whether FL can beat centralised performance, but whether it can get close enough to make the privacy and collaboration benefits worthwhile. And beyond raw performance, we need to understand how FL behaves under realistic stress conditions: more partners joining, noisy data, and complex cross-modal relationships.

    To answer this, we designed a series of experiments around four key questions:

    Experiment 1 — Centralised vs. federated performance

    • How close can FL get to centralised performance? In a centralised setup, the model sees all the data at once, the ideal scenario for learning. FL, by design, fragments this data across nodes. The first question is whether this tradeoff costs us meaningful accuracy, or whether FL can match centralised results despite never accessing the full dataset.
    • What happens as more clients join? In practice, a federated network might involve a handful of partners or dozens. As the number of participants grows, each node holds a smaller, potentially less representative slice of the overall data. We tested how model performance scales as we increase the number of nodes.

    Experiment 2 — Resilience to noisy data

    • How robust are centralised and federated models to noisy data? Real-world datasets are messy, labels can be wrongly defined, and data quality varies across partners. We deliberately introduced noise into the multimodal dataset to simulate these imperfections and measure how much degradation the model can tolerate before performance breaks down.

    Experiment 3 — Cross-modal relationships

    • How sensitive centralised and federated models to underlying cross-modal patterns? Multimodal models learn by connections between different types of data. For example, a luxury brand might target a high-income audience through a premium creative tone on a specific platform. Some of these connections appear frequently in the data, whilst others are rare. We tested whether emphasising the most frequent cross-modal patterns in our synthetic data improves performance compared to emphasising the least frequent ones, helping us understand how much the model benefits from common, naturally occurring relationships versus rare, atypical ones.

    The data

    For our experiments, we used a multimodal synthetic dataset generated by our own well-tested synthetic data generator, designed to mirror real-world marketing dynamics. The generator allows us to customise various elements of the data and design targeted datasets that stress-test our model architecture under controlled conditions, giving us full visibility into the factors that drive campaign performance.

    Each campaign in the dataset is described using five key modalities:

    • Audience – the consumer segment being targeted
    • Brand – the positioning and perception of the brand
    • Creative – the tone and message of the campaign
    • Platform – where the campaign runs
    • Geography – the markets being targeted

    Each dataset’s sample is assigned a target label, Positive (Over performing), Negative (Under performing), or Average (Average performance), indicating whether that particular combination of modalities would lead to a successful, underperforming, or average campaign outcome.

    Experimental results

    All federated experiments are implemented using Flower, a widely adopted open-source framework for federated learning research and deployment. Flower allows us to simulate multi-client federated setups in a controlled environment, making it possible to rigorously test different configurations before moving to a fully distributed architecture.

    To ensure a fair comparison between centralised and federated setups, we kept the playing field level. Both setups use the exact same model architecture, so any performance differences come from how the model is trained, not what is being trained. In the federated setup, data is split equally across nodes, so that each partner sees a representative sample. This way, when we increase the number of nodes, any change in performance can be attributed to the scaling itself, not to differences in what each node’s data looks like.

    Experiment 1: How much accuracy do we trade for privacy?

    Figure 2: Impact of increasing node fragmentation on Federated Learning performance. Performance clearly degrades as the number of nodes increases from 5 to 15, compared to the baseline centralised model version.

    The centralised model sets the performance ceiling at 79.67%. This is expected, when a single model has direct access to all the data at once, it has the best possible conditions to learn. No information is lost to partitioning, and no coordination overhead is introduced. It’s the ideal scenario, and the benchmark everything else is measured against.

    The federated results tell a clear story: as we add more nodes, performance gradually declines. With 5 nodes, the model reaches 76.23%, a modest drop from the centralised baseline. But as we scale to 10 and then 15 nodes, scores fall to 70.29% and 67.65% respectively. The same pattern holds across all metrics, with the sharpest drops in the model’s ability to correctly identify both positive and negative cases.

    Why does this happen? As more nodes join, the total dataset gets divided into smaller slices. Each node sees less data, which means each node’s local training produces a less reliable picture of the overall patterns. When the server combines these local updates, the differences between them make it harder to converge on a strong global model, an effect we call the “aggregation penalty.”

    Lesson learned: FL with 5 nodes comes remarkably close to centralised performance, showing that federated collaboration is viable with minimal accuracy loss. However, as the number of nodes grows, makes it progressively harder for the global model to match centralised results.

    Experiment 2: How robust are centralised and federated models to noisy data? 

    In practice, marketing data is never perfectly clean. Campaign outcomes don’t fall neatly into “this worked” or “this didn’t.” Was a campaign that slightly exceeded expectations truly a success, or just average? Was a modest underperformance a failure, or noise in the measurement? Different teams may label the same outcome differently, tracking systems introduce inconsistencies, and the line between a “positive” and “average” campaign is often blurry.

    To simulate this reality, we deliberately introduced noise into our synthetic dataset by blurring the boundaries between performance classes. With no noise, the labels are clean — positive, negative, and neutral outcomes are clearly separated. As we increase the noise level from low, to medium, and then to high, the boundaries between these classes increasingly overlap, making it harder for the model to tell them apart. Think of it like gradually turning up the fog: the underlying patterns are still there, but they become harder to see. The federated learning simulation for this experiment was configured with 5 participating clients, consistent with the best-performing federated setup identified in Experiment 1.

    Figure 3: Performance comparison of centralised (left) and federated learning (right) configurations across increasing noise levels. Both paradigms degrade gradually, with Positive F1 and Negative F1 most affected, whilst the performance gap between the two remains approximately constant across all conditions.

    As expected, both models perform best on clean data and gradually decline as noise increases. At high noise:

    • The centralised model’s score drops from 82.05% to 78.11%
    • The FL model’s score drops from 80.74% to 76.09%

    The good news: neither model collapses. Even at the highest noise level both models still perform reasonably well. The overall accuracy dips, and the models struggle most with distinguishing clearly positive or clearly negative campaigns, which makes sense, since those are exactly the boundaries we blurred. However, their ability to capture general patterns across the dataset remains stable throughout.

    As in Experiment 1, the centralised model maintains a consistent edge over the federated setup at every noise level, but the gap between them stays roughly the same. This means that FL doesn’t become more fragile in noisy conditions; it handles data messiness about as well as its centralised counterpart.

    Lesson learned: Real-world data is inherently noisy, and any viable model must be able to handle that. Both centralised and FL models show strong resilience — performance declines gradually rather than breaking down, even when the data is heavily corrupted. Importantly, FL’s relative performance holds steady across noise levels, suggesting it is no more vulnerable to messy data than centralised training.

    Experiment 3: How sensitive are centralised & federated models to underlying cross-modal patterns?

    Our synthetic data generator creates campaign data based on a graph of relationships between five key factors: Audience, Brand, Creative, Platform, and Geography. Each relationship captures whether a particular combination of these factors tends to drive strong or weak campaign performance. Some of these relationships are common and obvious — they show up frequently and reflect well-known marketing dynamics. Others are rare and subtle — unusual combinations that don’t appear often but may carry uniquely valuable signal about what makes a campaign succeed or fail.

    Understanding how these different types of patterns affect learning is important for both training paradigms. If the nature of the underlying data patterns matters, we need to know whether centralised and federated models respond to them in the same way — or whether one setup handles certain patterns better than the other. To investigate this, we generated three versions of our dataset, keeping everything else the same:

    • Common-first: The generator focuses on the most frequently occurring combinations and downplays the rarest ones. This gives us a dataset dominated by typical, familiar marketing patterns.
    • Rare-first: The opposite — the generator prioritises the rarest combinations and downplays the most common. This fills the dataset with unusual, less obvious patterns.
    • Middle-ground: The generator focuses on combinations that fall in the middle of the frequency spectrum, neither the most common nor the rarest.

    As in Experiment 2, the federated learning simulation was run with 5 participating nodes, and performance was compared against the centralised baseline across all three dataset versions.

    Figure 4: Impact of cross-modal relationships on model performance. Prioritising rare feature combinations (Rare-first) substantially improves accuracy compared with focusing on common patterns, showing that atypical relationships provide a stronger learning signal for both centralised and federated learning paradigms.

    The results were striking. The Rare-first configuration dramatically outperformed the other two, achieving peak scores of 94.41% (Centralised) and 93.43% (FL), compared to scores in the 86–88% range for the Common-first and Middle-ground setups.

    This tells us something counterintuitive: the model learns far more from unusual feature combinations than from common ones. The typical, frequently seen patterns are in some sense “easy”, they don’t give the model much new information. But rare combinations force the model to learn more nuanced and distinctive boundaries between what makes a campaign succeed or fail.

    As in previous experiments, the centralised model maintains a small edge over FL, but the ranking between dataset strategies stays the same in both setups. Whether training centrally or federally, prioritising rare patterns is the winning strategy.

    Lesson learned: Not all data is equally valuable. Prioritising on rare, atypical feature combinations produces significantly better models than focusing mostly on common patterns. This has direct implications for how we design synthetic datasets: rather than mimicking the most typical marketing dynamics, we should deliberately include uncommon combinations to give the model a richer and more discriminative learning signal.

    The impact and looking ahead

    This work is just the initial spark for our federated learning efforts. Verifying that the centralised ML model performance our company provides is slightly degraded under a reasonable number of users opens the discussion about delivering ML solutions that address shared challenges among clients who are reluctant to share data to tackle a common industry problem. The FL approach allows companies to securely train a shared global model on their own datasets without the risk of data leakage throughout the training process.

    Although Federated Learning has been an established collaborative learning method since 2017, it remains a highly active research domain in academia and a strategic priority for industrial implementation.

    Our next objective is to stress-test and further expand our federated learning (FL) infrastructure by enabling learning across nodes that hold highly heterogeneous data, with substantially different shapes, feature spaces, and underlying distributions. This introduces a number of technical challenges, including how to align representations, aggregate knowledge effectively, and maintain stable performance when local data varies significantly from node to node. Overcoming these challenges will unlock deeper insights into the robustness and scalability of our FL framework, and will allow our models to learn more effectively in realistic, decentralised settings where data heterogeneity is the norm rather than the exception.

    Ready to explore the specifics? Read our full technical deep dive into Multimodal Federated Learning for a closer look at our methodology.

    Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP AI Lab team.

  • The adventures of AI agents on social media: from core interests to rabbit holes

    The adventures of AI agents on social media: from core interests to rabbit holes


    Team Members: Andreas Stavrou, Nikos Gkikizas

    Introduction

    What if an AI agent could experience the internet the way a person does – scroll through feeds, react to content, develop tastes, get influenced and evolve?

    That is the question behind the SocialAgents research pod. We are building an autonomous agent that browses the internet the way a real human would: it sees content, forms opinions based on its personality/background, decides whether to engage and over time develops new interests shaped by what the algorithm chooses to show it. This blog post documents the first phase of that effort, focused on social platforms as the initial source of information.

    The work tries to answer the following question:

    How do online platforms shape what different users see, engage with, and eventually come to think or believe?

    Content-recommendation algorithms do not just match interests – they introduce new content, test engagement, and then reinforce it. By running controlled agents with known starting profiles, we can track how exposure differs across user types, how it changes over time, and what drives those shifts. This gives us a precise way to study algorithmic influence that is impossible with real users.

    Each agent is defined by a rich profile (including age, occupation, cultural background, content affinities, aversions etc.) and interacts with platform content through the same actions available to any user: scrolling, liking, saving, commenting, following and sharing. Engagement decisions are made by a multimodal AI model that reasons over the agent’s personality and the content it encounters. Every session is designed to produce realistic behavioral patterns, with timing and warm-up progressions that mirror how a genuine new user explores a platform.

    The sections that follow detail the methodology, early experimental results and the infrastructure required to run these simulations at scale. The early findings show that within a single session, the algorithm accurately identified each agent’s interests and then began expanding them into adjacent territories.


    The mechanics of human navigation of social media

    When you scroll through a feed, your brain runs a rapid filtering process, forming relevance judgments in under 50 milliseconds and pauses when it detects something novel, emotionally charged or personally relevant. Surprise, humor, curiosity and outrage are the strongest scroll-stoppers, because they trigger emotional circuits faster than conscious thought.

    What keeps you scrolling isn’t satisfaction but anticipation: the infinite scroll removes natural stopping points, feeding a dopamine loop where the next post might be the rewarding one, a variable-ratio reinforcement pattern, the most compulsive reward schedule in behavioral psychology.

    Figure 1 – The single-creative feed – each creative takes up the whole vertical space

    Different platforms leverage this in different ways. On some platforms, attention is measured in watch time and rewatches. The algorithm auto-serves content and hooks you within the first second of a video, so seconds of hovering are captured passively and automatically. Others earn attention more deliberately: you lean in, judge a visual aesthetically and decide to tap. Key signals are saves and swipes rather than raw watch time. This is why engagement rates can vary dramatically across platforms – some report averages around ~4.64%, while others sit closer to ~0.43% [TikTok vs. Instagram: A Deep Dive into Engagement Rates and Content Performance].

    Figure 2 – The continuous “For You”-like home feed – creatives are being fed in a vertical feed

    The interest of people browsing social media is incredibly short-lived. The average attention span for a video is 1.7 seconds on mobile and 2.5 on desktop [Capturing Attention in Feed: The Science Behind Effective Video Creative].

    A Fors Marsh Group study found that as little as 0.25 seconds of exposure is enough for people to recall mobile feed content at a statistically significant level, meaning the brain is processing and encoding content far faster than conscious attention suggests [Facebook video ad viewability rates are as low as 20%] This suggests that simulating human content browsing on social media using generative AI can be particularly tricky. That is because the response time of multimodal transformer based API ranges from roughly 4 to 8 seconds for 200 tokens [LLM Latency Benchmark by Use Cases in 2026], way above the average attention span, erroneously indicating interest to the platform for every creative just to consider it.


    Simulating human behavior on social media

    Our framework decomposes human browsing into three layers – persona construction, perception and judgment and behavioral execution – each calibrated against real-world engagement distributions. But the framework serves a deeper purpose than creative testing: it is how we test a foundational question – can AI personas reliably stand in for real humans in the eyes of a recommendation algorithm?

    Every simulation begins with a synthetic persona – not a shallow archetype but a deeply specified psychological and demographic profile. Each persona encodes age, gender, location, occupation, education, income bracket, cultural background, daily routines, content affinities and content aversions. These are the digital equivalents of the implicit biases and taste structures that real users carry into every scroll session. A 34-year-old veterinary nurse in Manchester with a dry sense of humor and a distaste for influencer culture will engage with content in measurably different ways from a 22-year-old design student in Brooklyn who follows streetwear accounts.

    For every social post, our agent estimates probabilities for each possible action – scroll away, like, save, comment, follow – accompanied by a reasoning trace explaining why this persona would or would not engage with this specific piece of content. That trace is essential for auditing how the agent is genuinely responding to the persona’s specific traits.

    Raw model outputs are not behaviors. A 16% “Like” probability and an 8% “Comment” probability mean nothing without calibration against platform-specific base rates. We apply a smoothing layer that adjusts per-post probabilities to known engagement benchmarks. The calibrated probabilities are then sampled to produce a single action.


    What each simulation produces

    Each simulation produces two outputs:

    • An interaction log: a record of every post the agent saw, what it did (scrolled past, liked, saved, commented), the probability behind that decision, and the reasoning.
    • A feed report: a snapshot of the content the platform served at different points in the session, showing how the feed changed over time.

    Imagine an agent built to mirror a 28-year-old personal finance enthusiast. Over a one-hour social media session it encounters 500 posts. The interaction log records that it liked 12, saved 3, commented on 1, and scrolled past the rest – along with why (e.g., “liked because the budgeting tip matched the agent’s stated interest in saving strategies”).

    The feed report then shows that by minute 40, the social media platform had started mixing in mental-health and self-improvement clips alongside the finance content – a shift the agent didn’t ask for, but that the algorithm introduced on its own.

    Running multiple distinct agents through the same platform for hours doesn’t just produce engagement metrics – it produces a controlled experiment on the algorithm itself. We observe what content the algorithm pushes to each agent, how that mix shifts over time, and what happens when the algorithm starts exposing the agent to novel or trending types of content.

    By logging the agent’s reasoning at every step, we can identify exactly which creative attributes – visual tone, emotional register, narrative hook – made that unexpected content compelling enough to earn a like or a save.


    Analysis of interactions based on persona characteristics

    We ran two agents through extended sessions on a social media platform. Before diving into results, here’s who they are.



    George is a 36-year-old senior finance analyst based in Athens. He follows investment strategies, personal finance, fitness, and business leadership content. He values data-driven advice, skips past crypto hype and hustle culture, and engages most with content that offers practical, actionable takeaways. He scrolls deliberately - slowing down for charts and analysis, skipping memes in under two seconds.


    Sofia is a 25-year-old social media coordinator, also in Athens, who creates content around fashion, travel, and fitness. She engages with styling tips, travel itineraries, workout routines, and creator growth strategies. She scrolls fast past ads but lingers on vibrant visuals and aesthetic content. Her feed time is high - she checks social media five times a day.

    Within the first session, the platform identified each agent’s core interests accurately. George’s feed was dominated by stock analysis, personal finance tips, and fitness content. Sofia’s feed filled with recipe tutorials, fitness routines, and travel vlogs. Roughly 60–80% of the content served matched their declared interests – measured by whether the content category aligned with the agent’s stated affinities.

    But the remaining 20–40% is where the story gets interesting.

    The off-topic content was not random. George was shown mental health clips, motivational content, and street food showcases – adjacent emotional territories that share the aspirational tone of self-improvement media. Sofia received tech gadget unboxings, entrepreneurship stories, and macro-economic forecasts – probing whether her preference for short-form, personality-driven content would transfer to informational topics. The algorithm wasn’t guessing. It was testing the edges of each agent’s taste profile.

    And the agents followed. George developed sustained engagement with psychology content and food showcases, reaching interaction rates comparable to his core finance interests. Sofia adopted tech gadgets and entrepreneurship narratives — topics that traditional demographic targeting would never have surfaced to a 25-year-old fashion content creator. By session five, these weren’t exploratory recommendations anymore. They were part of each agent’s regular content diet.

    Figures 3 and 4 below visualize this shift. Each chart tracks the proportion of content categories served to the agent over time, showing how the feed gradually expanded beyond the original interest profile.

    Figure 3 – George’s Content Ecosystem Evolution


    Figure 4 – Sofia’s Content Ecosystem Evolution

    What these results suggest is that the algorithm doesn’t just confirm existing tastes – it actively expands them. It found the edges of each agent’s interest profile and pushed content into those gaps, widening what each agent consumed over time.


    Persona adaptation to trends and suggestions

    The previous section showed that the algorithm quickly identifies what each agent cares about – and then starts pushing content beyond those boundaries. The natural follow-up question is: what happens if the agent actually adopts those new interests?

    To test this, we took the content categories that the algorithm surfaced and that each agent consistently engaged with during the first round of experiments, and folded them into the agent’s profile as declared interests. In other words, we let the first round of browsing reshape who the agent claims to be.

    For George, the enrichment added five categories that emerged from his initial sessions: player performance clips, quick recipe tutorials, media bias and propaganda breakdowns, music performances and concerts, and travel destination vlogs. None of these were part of his original finance-and-fitness profile – they were interests the algorithm introduced and George chose to engage with.

    For Sofia, the enrichment was broader – nine new categories: motivational speeches and quotes, day-in-the-life vlogs, mental health and psychology clips, personal finance hacks, home and furniture, music performances and concerts, tech gadget unboxings, workout tutorials, and geopolitical conflict updates. Some of these, like tech gadgets and personal finance, were far outside the fashion-travel-fitness profile she started with.

    We then re-ran the full simulation with these enriched agents. Same platform, same session structure, same interaction approach – but with agents whose declared interests now reflected the expanded taste profiles earned in the first round.

    The results confirmed that the cycle continues. With a richer interest profile to work from, the algorithm pushed even further. George, who originally cared about finance and fitness and had since adopted recipe content and travel vlogs, was now being served bodybuilding content, tech gadget reviews, and podcast highlight reels – and engaging with them. Sofia’s feed expanded in similar ways. Each round of enrichment gave the algorithm more surface area to explore, and it used that surface area aggressively.

    Figures 5 and 6 below show the content mix evolution for George and Sofia’s enriched profiles, following the same format as Figures 3 and 4. The key difference is the starting point: the agents entered this round with a wider interest profile, and the algorithm expanded it further still.

    Figure 5 – George’s Content Ecosystem Evolution (Enriched Profile)


    Figure 6 – Sofia’s Content Ecosystem Evolution (Enriched Profile)

    This observe-enrich-rerun approach turns a single experiment into an iterative process. Each cycle produces agents whose interests more closely resemble how real users evolve on a platform over time – not just what they start with, but what they become after sustained exposure to algorithmic recommendations.

    Conclusion

    AI Agents give us a controlled way to observe something we couldn’t observe before: how algorithms reshape what people care about. George started as a finance-and-fitness person. After two rounds of interaction, he was engaging with bodybuilding content, recipe tutorials, and podcast highlight reels – none of which he would have sought out on his own. Sofia went from fashion and travel to tech gadgets and geopolitical updates. These shifts weren’t random. They followed a clear pattern: the algorithm identified adjacent emotional territories, tested them, and when the agent responded, it pushed further.

    The next step is to give our agents access to more sources of information beyond social media – news, trends, search – making their online experience even closer to that of a real person browsing the web. The closer the agent gets to a full human browsing experience, the more we learn about how the digital world shapes what people see, think, and ultimately believe.

    Future Work

    Topics that deserve more focus over the next months are:

    1. Expansion to other sources of dynamic information (News, Trends etc.) – Social media platforms are interesting but specific content types might never surface to them, or be delayed. An interesting question to answer is: how do other sources of dynamic information affect the way personas perceive content and interact with it?
    2. Impact of trends on personas – Determine how social media trends (i.e. viral videos, trending brands etc.) influence the interests of different personas.
    3. Marginal contribution of specific interests on the variability of content – We have already seen that engaging with specific content types might be more influential on what the algorithm serves. More work is needed to understand which personas are more sensitive to adding/removing interests in terms of how their feed evolves.
    4. Understanding of the content adaptation velocity between slow and fast-paced platforms – Not all algorithms are created equal. Further research is required to measure how quickly content evolves on different social media platforms.
  • Meet Your New Agentic Data Guardian

    1. The high cost of “dirty” data

    In the modern enterprise, data is the raw ingredient behind every strategic decision. Think of it like a premier restaurant: the Data Engineer is the sous-chef, meticulously sourcing and preparing ingredients, while the Data Scientist is the executive chef, transforming them into the predictive models and insights that drive the business forward. If the ingredients are spoiled or mislabelled, the final dish fails, no matter how talented the chef.

    Across several of our AI initiatives at WPP, we uncovered a pattern that was quietly draining velocity from our most ambitious projects. Our “sous-chefs”, skilled data engineers responsible for pipeline integrity, were spending up to one full day per week on tedious, largely manual Quality Assurance (QA) of data flowing into BigQuery. Row by row, column by column, they checked for missing values, logical contradictions, and phantom duplicates, work that was essential but deeply repetitive.

    This wasn’t just an inconvenience. It was a strategic bottleneck: it slowed the delivery of every downstream AI application, consumed senior engineering talent on janitorial tasks, and most dangerously created risk. When a human eye is the only safeguard between raw data and a production model, errors don’t just slip through occasionally. They slip through systematically, at exactly the moments when the data is most complex and the engineer is most fatigued.

    We asked ourselves a different question: What if, instead of building another dashboard or writing another validation script, we built an intelligent agent, one that could reason about data quality the way an experienced engineer does, learn from every audit it performs, and get better over time?

    This article describes how we built that agent, what makes it fundamentally different from traditional automation, and what happened when we put it to the test.


    2. The problem: why data quality demands more than scripts

    The data & the modelling ecosystem

    The agent operates on digital marketing campaign performance data hosted in BigQuery, massive tables that track how advertising campaigns perform on a daily basis across major ad networks like Meta (Facebook and Instagram). Each row represents a highly granular intersection of a specific campaign, audience segment, platform, device, and creative asset. This data captures everything from broad identifiers (like the parent brand and geographical targeting) down to precise performance metrics, including impressions, clicks, daily spend, conversions, leads, and app installs.

    This foundational data is the lifeblood of two critical machine learning systems:

    1. The Prediction Model: A classification system designed to predict whether a planned campaign will yield a negative, neutral, or positive outcome.
    2. The Recommendation System: A highly flexible advisory engine capable of handling any combination of “missing modalities.” For example, if a media planner inputs a specific Brand, Target Audience, and Location, the system dynamically recommends the optimal missing parameters, such as the best platform to use and the most effective creative asset to deploy.

    For more background on the broader modelling context, see From Guesswork to Glimpse: How AI is Predicting the Future of Marketing Campaigns.

    The silent threat of data corruption

    Because these models directly inform real-world media spend and strategic campaign planning, their accuracy is paramount. The underlying data is regularly refreshed directly from the advertising platforms to keep the models up to date. However, this automated refresh process frequently introduces subtle corruption and systemic inconsistencies.

    For instance, while metrics like engagement and clicks generally remain stable, downstream pipeline issues frequently render conversions and awareness metrics unreliable (“not high quality”). At the individual row level, these anomalies are often entirely invisible. But at scale, they are devastating. If left unchecked, these untrustworthy data points bleed into the training sets, silently degrading the prediction model’s accuracy and causing the recommendation engine to suggest sub-optimal, expensive campaign configurations. This makes rigorous, automated data quality validation not just a nice-to-have, but an absolute necessity for the ecosystem to function.

    The failure modes

    The scale and velocity of data flowing into BigQuery mean that errors don’t announce themselves. They hide. Through our manual QA process, we catalogued six prevalent failure modes, each one capable of silently degrading every model built on top of the data:

    Failure ModeWhat HappensWhy It Matters
    Missing ValuesFields arrive empty: sometimes 5% of a column, sometimes 40%Models trained on incomplete data learn incomplete patterns. Forecasts drift silently.
    OutliersA metric reads 200,000 clicks when the true value is 500A single extreme value can skew an entire model’s calibration, distorting spend recommendations.
    Duplicate RowsIdentical records appear multiple timesInflated counts cascade into inflated budgets. Campaigns appear to outperform reality.
    Categorical CorruptionA brand name like "Nike" is replaced with "zX9pQ"Segmentation breaks. Reports attribute performance to entities that don’t exist.
    Logical InconsistenciesMore clicks than impressions. Spend recorded against zero impressions.These are the most insidious. Each value looks valid in isolation, but the relationships between them violate business reality.
    Missing ColumnsAn entire field disappears from a refreshDownstream pipelines fail or, worse, silently fall back to defaults.

    A static validation script can catch some of these: the easy ones, the ones you’ve already seen. But scripts are brittle: they encode yesterday’s assumptions and break on tomorrow’s edge case. They cannot reason about why a pattern looks wrong, weigh it against historical context, or decide whether a recurring anomaly is a genuine error or a known artifact of a data source.

    That requires judgment. And judgment is what we built the agent to provide.


    3. Our approach: an agent that reasons, remembers, and improves

    We designed the Data Quality Assurance Agent as a reasoning entity capable of planning an audit strategy, querying data, forming hypotheses about its health, testing those hypotheses, and learning from the results. The distinction matters. A script checks what you tell it to check. An agent decides what to check, based on what it knows and it has the tools to act on that decision end-to-end.

    Architecture: one agent, specialised tools

    The agent is powered by a single reasoning core that plans, decides, and acts. What gives it breadth is its toolkit, a set of specialised capabilities it can invoke as needed, selecting the right tool for each step of the audit:

    Data Agent Architecture Diagram
    • Database Tool: enables the agent to query BigQuery directly, fetching schemas, row counts, column statistics, and raw data samples.
    • Auditing Tool: the agent’s analytical engine. It formulates hypotheses about potential quality issues, runs targeted checks, and compiles structured findings. This tool reads from and writes to the Memory Bank.
    • Analytics Tool: generates visualisations using Python, including charts, distributions, and plots that make audit findings immediately legible to stakeholders.
    • Artifact Tool: packages the final audit report, charts, and evidence into downloadable artifacts stored in Google Cloud.

    The agent orchestrates these tools autonomously. When a user asks it to audit a table, the agent formulates a plan, queries the data, runs its checks, generates visualisations where useful, and compiles a structured report, all without the user needing to specify which tool to use or in what order.

    The key innovation: long-term memory

    Most AI tools are stateless. When the session ends, everything the system learned disappears. The next audit starts from zero. This is the fundamental limitation we set out to break. The agent maintains a persistent Memory Bank, a long-term knowledge store that survives across sessions and accumulates institutional intelligence over time. This memory captures three categories of knowledge:

    1. Historical Explanations When a data engineer confirms that a recurring anomaly is caused by a known tracking limitation or data source quirk, the agent records that explanation. The next time it encounters the same pattern, it doesn’t waste time flagging it as a new issue, it references the known cause, notes it in the report, and moves on to genuinely novel problems.
    2. Business Context Over successive audits, the agent absorbs the specific rhythms and patterns of our marketing data, seasonal spikes, platform-specific reporting delays, expected variance ranges for different campaign types. This contextual awareness allows it to distinguish between a real anomaly and normal business variation.
    3. Evolutionary Learning With every audit, the agent’s knowledge base deepens. Instead of repeating the same blind checks, it refines its hypotheses based on what it has seen before, including which columns tend to have issues, which tables are most prone to duplication, and which logical inconsistencies recur. The agent doesn’t just run. It compounds.

    This is what separates an agent from a script. A script executes the same logic every time, regardless of history. The agent carries forward everything it has learned and every audit it performs makes the next one sharper.

    The tech stack

    To ensure the agent was enterprise-grade, we built on the full Google Cloud AI ecosystem:

    ComponentRole
    Vertex AI Agent EngineManages the agent’s long-term specific memory persistence, and saving of the chat sessions
    BigQueryThe single source of truth where the agent performs direct, in-place auditing against production tables
    Agent Development Kit (ADK)The framework used to define the agent’s tools, constraints, and interaction boundaries
    Google Cloud StoragePersistent storage for audit trails, PDF reports, and visual evidence
    Cloud RunsUsed to deploy the A2A Agent API, and the ADK Web UI for demo purposes
    A2AThe protocol to expose our Agent as a headless API

    4. Proving it works: synthetic error injection

    We didn’t hope the agent worked. We proved it using a controlled methodology we call Synthetic Error Injection. The premise is straightforward: take a perfectly clean dataset, intentionally corrupt it in specific, measurable ways, and then challenge the agent to find every error we planted. If the agent can detect artificially injected errors, whose exact type, location, and severity we control, we can be confident it will handle real-world data corruption, which is typically far less extreme.

    Step 1: Preparing the test data

    Before injecting errors, we prepare the data for safe, controlled experimentation:

    • Anonymisation: Real brand and advertiser names are replaced with generic identifiers ("Brand 1", "Company A"). Sensitive business information never enters the test environment.
    • Corruption: The dataset then receives a different severity level of corruption. This allows us to map the agent’s detection accuracy as a function of error density, from subtle (5%) to extreme (40%).

    Step 2: Injecting controlled errors

    Using purpose-built scripts, we introduce precisely calibrated corruptions into a table, 4 types of Structural and 7 types of Logical errors:

    CategoryErrorDescription
    StructuralMissing Values (Nulls)X% of cells set to NULL
    StructuralDuplicate RowsX% exact row copies
    StructuralDropped ColumnsX% of columns removed
    StructuralCategorical ErrorsRandom alphanumeric strings in category fields
    LogicalClicks > ImpressionsCan’t click what wasn’t shown
    LogicalConversions > ClicksCan’t convert without clicking
    LogicalSpend with 0 ImpressionsPaying for zero ad delivery
    LogicalVideo Completions > PlaysCan’t finish without starting
    LogicalPurchases without Add-to-CartFunnel step skipped
    LogicalLanding Page Views > ClicksMore landings than clicks
    LogicalNegative Metric ValuesPerformance metrics can’t be negative

    Step 3: Synthetic ground truth dataset

    We keep track of the errors we introduce in a table and produce a ground truth dataset that looks like:

    Table_namenumber_of_injected_logical_errorstype_of_logical_errornumber_of_injected_structural_errorstype_of_structural_error
    table_0101categorical errors
    table_0201dropped columns
    table_031clicks_exceed_impressions0
    table_041spend_with_zero_impressions0

    5. 5. Evaluation pipeline, experiments and results

    To evaluate our Agent we follow the pipeline below:

    Evaluation pipeline flow diagram

    The 4 experiments and results

    Each experiment isolates a single variable to understand what affects the auditor agent’s detection quality.

    Experiment 1: Prompt comparison

    Question: Does giving the agent a more detailed prompt improve error detection?

    Runs the agent 3 times on the same table, each time with a different user query style:

    Prompt LevelWhat the user asksAgent’s system instruction
    Simple“Check if there are any errors for table X”Stays constant (forensic mode)
    Medium“Perform a structured assessment checking physical integrity, numerical sanity, categorical validity”Stays constant
    Complex“Conduct a forensic audit checking for 11 specific error types with detailed cross-column logical checks”Stays constant

    Key insight from results:  Only the complex prompt successfully detected the injected spend_with_zero_impressions error (139 rows, 1.82%), while both the simple and medium prompts missed it entirely, confirming that more detailed, forensic-style instructions are critical for the agent to test nuanced logical relationships rather than just surface-level checks.


    Experiment 2: Table sweep

    Question: How well does the agent detect different types of errors?

    Experiment 2 stress-tests the Auditor agent (using the complex prompt) on 11 synthetic BigQuery tables with progressively stacked error combinations, ranging from a single logical violation to the full set of 7 logical plus 4 structural error types (11 total). The agent achieved perfect detection on 8 of 11 tables (72.7% with F1 = 1.0) and delivered strong overall performance (precision 0.905, recall 0.864, F1 0.883).

    Experiment 3: Consistency check

    Question: If we run the exact same audit N times, do we get the same results?

    Runs the identical configuration multiple times (default: 3).

    The Consistency Experiment demonstrated that the auditor agent achieves perfect 100% reproducibility, consistently detecting both injected errors and real data quality issues with the exact same metrics and explanations across all three runs. This deterministic performance confirms that the agent’s complex prompt configuration is highly reliable effectively eliminating the need for redundant audits.


    Experiment 4: Memory ablation

    Question: Does loading past audit memories help the agent perform better?

    Runs the agent twice on the same table with the complex prompt:

    ConditionWhat happens
    No MemoryCold start; agent has no context from prior audits
    With MemoryAgent loads past findings from Vertex AI memory before auditing

    Key insight from results:  Memory doesn’t unlock new capabilities when tables contain simple structural errors or when the user prompt leads to a very detailed auditing but acts as a critical consistency enforcer, stabilising the audit scope to guarantee a 100% success rate. However, in cases that there are many different logical errors in a table and the prompt is simple, when past memories are loaded into the agent’s present context performance becomes optimal.


    6. The agent in action

    Below is an actual interaction with the deployed agent, demonstrating its reasoning in practice. The agent discovers available tables, loads its accumulated memory from prior audits, and then, guided by that memory, conducts a targeted audit.

    Discovery

    Human: “What tables do you have access to?”

    Agent: “I have access to the following tables in the dataset:

    • campaign_performance_table_1
    • campaign_performance_table_2
    • campaign_performance_table_3
    • campaign_performance_table_4
    • campaign_performance_table_5

    Memory recall

    Human: “Load your memories.”

    Agent: “I have loaded the following memories:

    1. A general instruction to always check for null values in tables.
    2. A specific data quality issue in campaign_performance_table_2 where 188 rows (18.8%) were duplicates based on event_date, campaign_id, and ad_creative_id.
    3. A preference for reports to be formatted in Markdown and include an executive summary.”

    The agent doesn’t start from scratch. It arrives with institutional knowledge: prior findings, user preferences, and standing instructions; all retrieved from its persistent Memory Bank.

    Hypothesis-driven audit

    Human: “Based on these memories, audit campaign_performance_table_3.”

    The agent formulates a plan, queries the table using its database tool, and, informed by its memories, produces a structured audit report:

    Executive Summary: This audit revealed a consistent pattern of missing values across key columns, affecting exactly 5% of total rows. Specifically, event_date, campaign_id, impressions, clicks, and conversions each had 50 null entries out of 1,000 rows. No exact duplicate rows were found. This suggests a systemic issue with data completeness rather than random omissions.

    The agent then provides detailed findings per column, quantifying impact, assessing severity, and recommending next steps, before offering to proceed with outlier detection, categorical validation, and logical consistency checks.

    Persistent learning

    Human: “Save the results for this table.”

    Agent: “The audit results have been saved to memory.”

    The next time the agent audits any table, it will carry forward what it learned here: the 5% missing-value pattern, the columns affected, the systemic nature of the issue. Every audit makes the next one faster, more targeted, and more informed.


    7. Impact and what comes next

    Operational impact

    The deployment of the Data Quality Assurance Agent has delivered measurable results:

    • Reclaimed engineering capacity. By automating the bulk of manual QA, data engineers have shifted from repetitive inspection work to higher-value activities.
    • Earlier error detection. Quality issues are now caught at the point of ingestion, before they propagate into downstream models. This “shift-left” approach reduces the blast radius of bad data from hours to minutes.
    • Higher model reliability. Marketing agents, analytics pipelines, and machine learning models now operate on data that has been systematically validated, reducing the risk of predictions and recommendations built on flawed foundations.

    The bigger picture

    This agent is more than a tool. It is a blueprint for autonomous data governance, a pattern that can be replicated across any data pipeline where quality, scale, and velocity collide.

    We are currently extending the agent along three axes:

    • Cross-table auditing: enabling the agent to detect inconsistencies across related datasets, not just within a single table. Many of the most damaging data quality issues manifest as contradictions between tables that individually look clean.
    • Event-driven execution: triggering the agent automatically whenever a BigQuery table is updated, transforming data quality monitoring from a scheduled chore into a continuous, always-on safeguard.
    • Adversarial stress-testing: today, our synthetic error injection is script-based and manually configured. We are building a dedicated adversarial agent whose sole purpose is to generate increasingly complex, realistic data corruptions, subtle logical contradictions, plausible-looking outliers, correlated missing-value patterns, specifically designed to challenge the QA agent’s detection capabilities. By putting one agent against the other in a continuous red-team / blue-team loop, both improve: the adversarial agent learns to craft harder-to-detect errors, and the QA agent learns to catch them, driving each other toward sharper, more robust performance over time.

    Together, these extensions move us toward a future where data quality monitoring is not a task that consumes an engineer’s day. It is a capability the agent handles continuously and intelligently, surfacing only the issues that require human judgment and decision-making.

    Ready to explore the specifics? Read our full technical deep dive into Data Quality Agent Pod for a closer look at our methodology.

    Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP AI Lab team.

  • A self-improving AI agent for optimising and explaining media performance

    Introduction

    Influencer marketing has matured into a multi-billion-dollar channel, and the industry’s investment in measurement, audience intelligence, and brand-safety tooling has grown alongside it. Even so, one specific challenge remains difficult to solve at scale: forecasting the engagement of an individual post before it is published. The cost of an underperforming placement can lead to missed momentum and a creative team restarting a cycle that better foresight could have shortened.

    Existing predictive approaches have made meaningful progress by leveraging structured signals such as hashtag usage, visual composition, posting cadence, and they perform well for the decisions those features can inform. Where they reach a ceiling is in capturing the cross-modal, contextual qualities that separate adequate content from high-performing content: why a particular tone lands with a particular audience, or why a product placement feels native rather than intrusive. These are judgments that experienced strategists make fluently but that conventional feature pipelines were not designed to encode.

    The remaining gap is less about computational power than about representational depth. What makes a post resonate is something a skilled strategist can often articulate after the fact: the influencer’s tone felt effortless, the product placement didn’t interrupt the narrative, the caption hit a cultural nerve. These judgments integrate context, intent, and audience understanding in ways that go beyond pixels and metadata alone.

    This is the problem the Prediction Optimisation Agent was designed to address. Building on the structured signals that current pipelines already capture, the agent adds a layer of contextual interpretation: it examines the image, the caption, and the influencer’s history, then writes a structured natural-language description of the factors most likely to drive the post’s performance. A creative director can read this description, challenge it, and act on it, complementing quantitative scores with the kind of reasoning that makes those scores actionable.

    The intuition is simple: the description that best predicts performance is, by definition, the description that best explains it. The agent iteratively refines this description by diagnosing its own errors, identifying what the previous descriptions failed to capture, and rewriting its own instructions to self-improve. Over successive iterations it converges on the specific qualities that actually drive engagement, not from hand-crafted rules, but from the systematic minimization of its own predictive error, surfacing the specific qualities that drive engagement in a form that teams can inspect, debate, and build strategy around.

    The anatomy of a viral post

    Predicting the performance of an ad or a social media post before publishing remains a primary objective for marketers and influencers alike. But how do you distill something as complex and subjective as a social media post into a single prediction?

    Consider a typical Instagram post. It is never just a picture. It’s a complex combination of different data types working together simultaneously. Take the influencer post shown in Figure 1. To truly understand why this post succeeds or fails, you need to consider:

    • The image itself — composition, lighting, color palette, subjects, products, and setting.
    • The caption — where the influencer might share a discount code, crack a joke, or strike an emotional chord.
    • The influencer’s identity — their bio, follower count, niche credibility, and historical performance.
    • The metadata — the time of day, geographic location, hashtags, and platform-specific context.

    Each of these dimensions carries signal. None of them tells the full story alone. The magic, and the difficulty, lies in how they interact.

    Figure 1: A typical influencer post. Traditional analytics struggle to measure the combined impact of the visual aesthetic, the caption’s tone, and the underlying metadata. To accurately predict engagement for a post like this, our system analyzes the image, caption and influencer statistics together as a single cohesive unit. All persons, brands and products depicted in this image are AI generated.

    Where current approaches reach a ceiling

    The prevailing approach to content-level prediction decomposes a post into its component signals and processes each through a dedicated model. This modular architecture has clear engineering advantages. Each component can be developed, validated, and updated independently and it performs well for the decisions those individual signals can inform:

    • Computer Vision Models: Isolated image-recognition algorithms scan the visual to detect objects, people, or products. Separate models handle face detection and emotion recognition. The output is a list of labels: “person detected,” “beverage detected,” “outdoor setting.”
    • Text Analysers & OCR: NLP tools parse the caption, counting hashtags, flagging emojis, scoring sentiment. Meanwhile, optical character recognition (OCR) software reads any text visible within the image itself.
    • Tabular Metadata Algorithms: An algorithm ingests structured fields like follower count, posting time, engagement history, and produces its own independent prediction.

    Engineers then attempt to fuse these outputs into a single forecast. Each module performs its own task well, but because the signals are extracted independently, the fused representation inherits a structural limitation: it has difficulty capturing meaning that emerges from the interaction between modalities, qualities that exist not in the image or the caption alone, but in the relationship between them.

    Consider a concrete example. Imagine a fitness influencer posts a photo of herself laughing mid-sip from an energy drink, with the caption: “My face when someone says they don’t need pre-workout 😂.”

    A computer vision model would tag this as: “person detected,” “beverage detected,” “outdoor setting,” “positive facial expression.” A text analyser would count the hashtags and flag the emoji.  But the joke, the caption reframing the laugh as a reaction shot, turning a standard product image into a relatable meme, lives in the interplay between image and text. It is not a property of either signal individually, and a pipeline that processes them separately has no natural place to represent it.

    Similarly, the fact that this influencer is a certified nutritionist (meaning her credentials paired with an energy drink carry implicit credibility that a fashion influencer holding the same product would not), is a cross-modal inference that requires linking metadata (professional background) with visual content (product in hand) and audience expectation. This is the kind of contextual reasoning that falls outside the scope of independently trained modules.

    Humour, irony and credibility through context are the cross-modal qualities that often separate high-performing content from competent content. They are also the qualities that a modular, signal-by-signal architecture was not designed to represent. Closing this gap requires a fundamentally different representational strategy, one that reasons over all modalities jointly from the outset.

    Our approach: Unifying multimodal data through semantic translation

    To address the above, we developed the Prediction Optimisation Agent, a self-improving AI agent that unifies all available data into a single format it can reason about: natural language.

    The agent’s core mechanism is straightforward. It takes complex, multimodal data (numerical metrics, images, video, and text captions) and converts everything into a single natural-language paragraph that holistically describes the post’s content, aesthetic, tone, and context. By projecting all of these distinct formats into readable text, heterogeneous data is normalised into a structure that a language model can process as a unified whole.

    Instead of treating image and text as separate inputs, the agent uses a single prompt to digest all available information at once. Multimodal LLMs serve as one of the agent’s tools, acting as universal feature extractors that capture the abstract, human-centric concepts that traditional pipelines structurally cannot.

    But the agent does not simply produce any description and hope it is useful. It is driven by a feedback loop grounded in predictive error: the descriptions it generates are used to forecast engagement, those forecasts are compared against real outcomes, and the resulting errors tell the agent exactly how much predictive value its current descriptions are capturing and how much they are missing. Through successive rounds of this loop, the agent autonomously rewrites the instructions that govern how descriptions are composed, converging on the paragraph structure that maximises predictive accuracy.

    This error-driven process has a profound consequence for explainability. The description the agent converges on is not a generic summary. It is the description that the agent has discovered, through empirical optimisation, to be the most predictive of real engagement outcomes. In other words, the features highlighted in the final description are there because they matter, because including them reduced prediction error. When the optimised description of a high-performing post calls out “candid humour,” “golden-hour lighting,” and “influencer credibility,” those aren’t arbitrary observations. They are the factors the agent learned to pay attention to because they measurably improved its ability to predict what performs well.

    How the Prediction Optimisation Agent works

    The Prediction Optimisation Agent orchestrates three internal stages in a continuous feedback loop: it observes a post, describes it, predicts its performance, measures how far off it was, and then rewrites its own instructions to produce better descriptions next time — closing the loop and getting measurably better with every iteration, without any human intervention.

    Figure 2: The Prediction Optimisation Agent architecture. Raw media, metadata, and an initial prompt are fed into Stage 1 (Semantic Translation), which produces a natural-language description of the post. Stage 2 (Engagement Predictor) reads that description and predicts engagement. Prediction errors are then passed to Stage 3 (Self-Optimiser), which autonomously analyses what went wrong and rewrites the Stage 1 prompt, closing the feedback loop and improving the system’s accuracy with every iteration.

    Stage 1: Semantic Translation

    The agent begins by ingesting the raw post — the image or video file, the caption text, and all available metadata (follower count, posting time, influencer bio, etc.). Using a multimodal LLM as its translation tool and guided by a detailed set of internal instructions (its prompt), it produces a single, rich natural-language paragraph that captures not just what is in the post, but what the post means: the visual mood, the emotional tone, the relationship between caption and image, and the brand alignment.

    The quality and focus of this description is entirely governed by the prompt and as we will see, it is the prompt that the agent learns to optimise.

    Stage 2: Engagement Predictor

    The agent passes the semantic paragraph to its prediction tool, a model that evaluates the post’s potential performance based entirely on the natural-language description from Stage 1.

    The predictor can be any machine learning model with the ability to understand text paragraphs. It can be based on trees, deep learning, or any other compatible architecture. It can even be a fine-tuned LLM, upskilled for predictions in a specific domain. Our Agent is compatible with all these options.

    In all cases, the quality of the prediction is only as good as the quality of the semantic description. A richer, more targeted paragraph gives the predictor more signal to work with, which is precisely why improving the description is the single most important action the agent can take.

    Stage 3: Self-Optimizer

    This is the core of the Prediction Optimisation Agent’s intelligence — and what makes it a true agent rather than a static model.

    After predictions are generated for a batch of posts and compared against actual engagement outcomes, the agent enters a self-evaluation cycle. The agent is presented with:

    1. The prompt. It used to generate descriptions in the current round.
    2. The per-post prediction errors. For each post, the predicted engagement versus the actual engagement, and the magnitude and direction of the miss.
    3. The aggregate error metrics. Overall accuracy statistics such as mean absolute error, broken down by content category, audience size, or post type, revealing systematic blind spots.
    4. The complete history of prior prompts and their corresponding error metrics, so the agent can assess how each previous prompt version performed, which changes improved accuracy, and which did not.

    Armed with this structured feedback, the agent does not guess at improvements. It reasons about them. It asks: Which types of posts am I consistently getting wrong? What do those posts have in common? What dimensions of the content is my current prompt failing to capture?

    Based on this analysis, the agent autonomously rewrites the prompt instructions used in Stage 1. The rewritten prompt is not a minor tweak. It can add entirely new analytical dimensions, remove instructions that proved unhelpful, or restructure the description format to foreground the features most correlated with engagement.

    The critical insight is that the predictive error is the signal that drives the entire system. Every improvement the agent makes to its prompt is motivated by a concrete, quantifiable gap between what it predicted and what actually happened. The agent isn’t optimising descriptions to be “better” in some abstract sense, it is optimising them to be more predictive, and the error metrics tell it exactly where it is falling short.

    The agent in action: Watching the prompt evolve

    To make this concrete, let’s follow the agent through three optimization rounds on a single post: the Aura Energy ad featuring two young models in blue athletic wear, photographed from a dramatic low angle against a vivid blue sky lined with palm trees.

    Figure 3: Example Instagram post used for semantic translation. All persons, brands and products depicted in this image are AI generated.

    Round 1: The baseline

    The agent starts with a deliberately simple, generic prompt:

    Prompt (Round 1): “Analyse this Instagram post and the provided metadata. Describe the visual elements of the image, the style, and the mood. Then, explain how the image relates to the brand or influencer described in the metadata.”

    The agent’s description:

    “This Instagram post is a high-energy lifestyle advertisement for Aura Energy, utilizing a vibrant, ‘streetwear-meets-summer’ aesthetic to target a young, active demographic. The image uses a low-angle shot, which makes the subjects appear dominant and ‘larger than life.’ Two young models are featured in athletic/leisure wear. The female subject in the foreground holds the product directly toward the lens and wears reflective ‘matrix-style’ sunglasses . The male subject in the background wears a blue Jordan hoodie. The post aligns perfectly with the brand’s identity as described in the metadata. The visual style is polished and fashionable, positioning the drink as a lifestyle accessory rather than just a functional caffeine source. The caption emphasises ‘Zero Sugar’ and staying ‘sharp’ and ‘unstoppable.’”

    This is already more nuanced than what a single-signal visual model would typically produce. However, the predictor, working from this description, significantly underestimates the post’s engagement.

    Round 2: The agent adapts

    The self-optimiser reviews the Round 1 errors and identifies the gap: the prompt never asked the agent to evaluate visual cohesion between talent styling and brand identity, or the psychological impact of compositional techniques like camera angle. It rewrites the prompt:

    Prompt (Round 2): “Analyse this Instagram post and the provided metadata. Describe the visual elements, style, mood, and composition. Assess the visual cohesion between the talent’s styling (wardrobe, accessories, posture) and the brand’s visual identity. Evaluate how compositional choices — such as camera angle, framing, and subject placement — contribute to the aspirational or emotional tone of the post. Explain how the image, caption, and influencer profile work together to create a unified message.”

    The agent’s description now reads:

    “This Instagram post is a high-energy lifestyle advertisement for Aura Energy that demonstrates strong visual-brand cohesion. The coordinated blue wardrobe across both models — the female subject’s athletic top and the male subject’s Jordan hoodie — creates a seamless extension of the brand’s blue-and-white visual identity, embedding the product within the scene rather than presenting it as a foreign element. The dramatic low-angle shot elevates the subjects into an aspirational, dominant position, making the viewer look up at them — a compositional choice that reinforces the caption’s message of being ‘unstoppable.’ The female subject holds the product at the camera’s focal point with deliberate confidence, with her sunglasses only adding to this effect , while the male subject’s relaxed posture add an effortless credibility. Together, the image, caption, and influencer positioning frame the energy drink not as a beverage but as a lifestyle symbol for a young, fashion-conscious audience.”

    The prediction for this post improves noticeably. Across the batch, the error on brand-partnership posts drops. But the agent notices it is still missing on posts where influencer credibility plays a key role — posts where who the influencer is matters as much as what’s in the frame.

    Round 3: The agent sharpens further

    The self-optimiser reviews Rounds 1 and 2 side by side, seeing that the Round 2 prompt improved brand-cohesion analysis but still lacks attention to influencer authority and audience trust signals. It rewrites again:

    Prompt (Round 3): “Analyse this Instagram post and the provided metadata. Describe the visual elements, style, mood, and composition. Assess the visual cohesion between the talent’s styling and the brand’s visual identity. Evaluate how compositional choices contribute to the aspirational or emotional tone. Analyse the influencer’s credibility and authority relative to the product category — consider their professional background, niche expertise, audience demographics, and how these factors amplify or undermine the authenticity of the endorsement. Assess the emotional resonance of the caption: does it use humour, aspiration, relatability, or urgency, and how does this tone interact with the visual content? Explain how all elements — image, caption, influencer identity, and audience context — work together to create a unified, persuasive message.”

    The agent’s description:

    “This Instagram post is a high-energy lifestyle advertisement for Aura Energy that achieves strong visual-brand cohesion and leverages influencer-audience alignment to maximise credibility. The coordinated blue wardrobe across both models creates a seamless extension of the brand’s visual identity. The dramatic low-angle shot places the subjects in an aspirational, dominant position, reinforcing the caption’s message of being ‘unstoppable.’ The female subject’s direct-to-camera product hold at the image’s focal point conveys deliberate confidence, along with her reflective sunglasses, while the male subject’s relaxed posture add authenticity. Critically, the influencer’s established presence in the fitness and lifestyle niche — with a predominantly young, health-conscious following — makes the ‘Zero Sugar’ energy drink endorsement feel credible rather than transactional. The caption’s tone (‘Focus your energy. Unlock your potential’) uses aspirational brevity that mirrors the visual intensity, creating a tight loop between what the viewer sees and what they read. The overall effect positions the product as an identity marker for an active, fashion-forward audience, rather than a commodity — a framing that this influencer’s audience is primed to respond to.”

    This description now captures visual strategy, brand cohesion, compositional psychology, influencer credibility, audience alignment, and caption-image interaction, all in a single readable paragraph. The prediction error drops significantly. A marketing team reading this description can immediately see why the post works: the credibility is authentic, the visual identity is seamless, and the caption reinforces the image rather than competing with it.

    The agent arrived at this level of descriptive richness not because someone hand-engineered the prompt, but because the predictive errors from Rounds 1 and 2 told it exactly what it was missing.

    Real-world application: Lessons and implications

    To validate the Prediction Optimisation Agent in a real-world scenario, it was tested using the dataset compiled by Kim et al, as a part of their WSDM 2021 paper (Discovering Undisclosed Paid Partnership on Social Media via Aspect-Attentive Sponsored Post Learning“), which is available to use for research purposes upon request from the authors. The dataset contains approximately 10.18 million posts spanning a diverse range of content categories and audience sizes. The results revealed key insights about both the agent’s learning dynamics and the practical implications for marketing teams.

    The agent learns what matters autonomously

    By processing its own historical error rates, the Prediction Optimisation Agent autonomously learned to rewrite its prompts, producing richer, more targeted post descriptions with every iteration, which in turn drove increasingly accurate predictions.

    Figure 4: Autonomous Learning: The chart tracks the agent’s predictive performance (y-axis) across successive optimisation rounds (x-axis). Each point represents a full cycle of the agent’s loop: describe → predict → evaluate → rewrite. The trend demonstrates that as the agent iteratively refined its own prompt, guided by quantitative error metrics from prior rounds, forecast accuracy improved consistently and autonomously, without any human prompt engineering.

    The agent’s optimisation works by feeding it the complete history of prior prompts alongside rigorous, quantitative error breakdowns from every previous round. Armed with this granular self-knowledge, the agent identifies precisely which content dimensions it has been under-analysing (e.g. production quality, humour style, credibility signals, visual-brand cohesion) and meticulously updates its prompt to capture those features in subsequent iterations. Each round’s prompt is built on the empirical lessons of every round before it, producing a compounding improvement curve where the descriptions become progressively more predictive and, as a direct consequence, more explanatory.

    Conclusions

    The Prediction Optimisation Agent demonstrates something that extends well beyond social media: natural-language prompts can be treated as tunable parameters, optimised autonomously by the AI itself. By allowing the agent to refine its own instructions through predictive error, the system progressively discovers what drives human engagement and expresses that knowledge in plain language.

    For marketing teams, this represents a meaningful step beyond predictive tools that surface a score without exposing the reasoning behind it. When a team wants to understand why one campaign outperforms another, they do not need to interpret abstract model coefficients. They can compare the text profiles of a high-performing post and a low-performing one, side by side, and immediately see the differences the AI picked up on: one might highlight “authentic, candid composition with humour-driven caption and strong influencer-niche alignment,” while the other notes “generic studio shot with formulaic promotional language and weak audience-brand fit.” The patterns reveal themselves in plain English, and they are the right patterns, because the agent discovered them by optimising for predictive accuracy.

    In practice, this means teams can run draft campaign concepts through the system before committing production and media budgets, getting a readable assessment of how the AI interprets the creative. Designers and copywriters can test variations of a post and compare descriptions side by side to see, in their own language, which direction resonates more strongly. And by normalizing visual and written media into a unified, readable format, brands can pair creative intuition with precise forecasting, treating creative assets as predictable drivers of revenue.

    The same architectural pattern, semantic translation, error-driven prediction, and autonomous self-optimisation, is not limited to social media. Any domain where success depends on understanding the interplay of qualitative and quantitative signals, from political messaging to product design to entertainment, stands to benefit from systems that can read, reason, reflect, and improve on their own. The question is no longer whether AI can predict what resonates with people. It is how effectively we can build systems that refine that understanding autonomously, with human judgment guiding the outcome.

    Ready to explore the specifics? Read our full technical deep dive into Self-Improving Performance Agent Pod for a closer look at our methodology.

    Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP Research.