Cracking the Code of Campaign Success with Google’s AlphaEvolve Agent

In the fast-paced world of digital marketing, one deceptively simple question keeps resurfacing: “What knowledge can we extract from successful past campaigns to make better future marketing decisions?”

Every brand sits on a goldmine of historical campaign data: thousands of images, videos, and overall campaign configurations that either soared or sank. The challenge isn’t a lack of information; it’s injecting that knowledge at the precise moment the next decision is being made. How do we operationalise lessons learned to answer questions like:

  • Prediction
    Given Brand A, the target region of São Paulo, a set of creatives featuring outdoor sports imagery, and audience group of millennials aged 25–34, how well is the campaign expected to perform? “
  • Recommendation
    “Given Brand B and a target region of Milan, what should the creatives (videos/images) look like to maximise engagement among environmentally conscious consumers aged 15–18?”

A common suggestion is to simply “ask an AI.” While modern Large Language Models (LLMs) are remarkably capable and encode broad real-world knowledge, they lack the tribal knowledge embedded in your proprietary data. They don’t know your specific brand voice, your audience’s unique quirks, or the subtle patterns behind your past failures. To truly win, you need a system that learns from your history—the hits, the misses, and everything in between.

To address this, the WPP Research invests significant effort in developing prediction and recommendation models trained on large and diverse volumes of historical campaign data. These models are highly competitive and continuously improving. However, at some point during development, progress inevitably hits a plateau: even incremental gains—rarely exceeding 1%—demand extensive bibliographic research, days or even weeks of trial-and-error experimentation, and painstaking fine-tuning.

With time at a premium, a vast space of possible improvements to explore (architectural changes, hyperparameter tuning), and experiments that are inherently slow to run, we turned to Google’s AlphaEvolve (AE) [1]: a Gemini-powered agentic framework that reframes model development as an evolutionary search problem. Rather than relying on manual experimentation, AlphaEvolve autonomously proposes, evaluates, and refines candidate model architectures in an iterative loop, guided by the expertise of our Data Science team and grounded in objective performance metrics.

The results are striking: what weeks of manual experimentation struggled to improve by a single percentage point, AlphaEvolve achieved in a fraction of the time, delivering prediction accuracy gains of up to 10% on both synthetic and real datasets, while simultaneously lifting downstream recommendation scores up to 7%.

Our access to AlphaEvolve came through Google’s Early Access Program (EAP), within the context of the ongoing partnership between Google and the WPP Research. Throughout our adventures with AlphaEvolve, we have been collaborating closely with the Google Research team, providing and receiving feedback. This collaboration has been invaluable to the project’s success.

The AlphaEvolve Advantage

Building a good AI model is painfully slow. A team of experts reads through mountains of research papers, rewrites code by hand, and runs experiments that can take days, only to find the improvement is tiny, or worse, a dead end. This research → code → test → repeat cycle creates a huge gap between having data and actually getting value from it.

And even after you pick a model architecture, you still have to tune it. Think of it like adjusting the equalizer on a stereo: dozens of sliders, each affecting the sound, and you’re trying to find the perfect combination by ear. Techniques like grid search and Bayesian optimization help, but they’re still limited by what the human designer guesses might work. Not what the data actually needs. Trying every possible combination? Far too expensive and slow.

The honest truth is that the search space is simply too vast for human intuition and trial-and-error to navigate. This is exactly where AlphaEvolve (AE) changes the game.

Instead of a person manually tweaking one model at a time, AE treats the entire development process as an evolutionary search. Much like natural selection, but for code. It generates candidate models as functional programs, runs them, and scores each one against a target metric. It doesn’t just tune models. It designs them from scratch.

Under the hood, AE is powered by Google’s state-of-the-art Gemini model, working hand-in-hand with a curated program database from Google DeepMind. Together, they explore millions of possible code configurations, zeroing in on the most accurate solution that meets our constraints. A search of this breadth would take a human team months. AlphaEvolve does it in a fraction of the time.

By shifting from manual experimentation to this autonomous framework, we don’t just speed things up. We uncover strategies and architectures that human intuition alone would never find. Figure 3 illustrates this iterative loop in action.

Figure 3 \ AlphaEvolve is a Gemini-powered coding agent from Google that automatically improves algorithms through a “generate, test, and refine” loop. The user provides three inputs: a description of the problem, a way to score candidate solutions, and a starting program to build from. AlphaEvolve then proposes many code variations using Gemini, scores each one automatically, and keeps the best-performing ideas—recombining and evolving them over multiple rounds, much like natural selection. With each cycle, the solutions get sharper, often surpassing what the original starting point could achieve.

Guided Evolution: The Human in the Loop

AlphaEvolve is autonomous, but it is not unsupervised. Think of it as digital evolution: the AI proposes ideas and keeps only the winners to build upon in the next generation. This process still requires careful navigation by our Data Scientists, who provide clear system instructions and constraints to guide the search through an infinite landscape of potential improvements, while inspecting for deviations introduced by the stochastic nature of LLMs. The result is a search that stays focused on logical, high-quality architectures and respects the real-world boundaries of the problem we are addressing.

In the example below, we illustrate the inputs that AE expects from the human in the loop, as well as the output that it produces.

Input 1: A System prompt describing the problem and steer evolution towards search directions.

An example system prompt is: “Evolve a training model for a neural network 3-class classifier that achieves high accuracy on a provided dataset. The model must consist of a loss function that… . Focus on the multi-objective optimization of the following scores… Consider changing the model architecture to include…

Think of the System Prompt as the instruction manual you hand to AlphaEvolve before it starts work. Imagine hiring a highly skilled but very literal engineer. They’re brilliant, but they need a clear, written brief to work from — they won’t assume anything. The System Prompt is that brief. It channels AlphaEvolve’s enormous computational power toward the right problem, in the right direction. It covers:

  • What the job is — e.g., “Build a model that can classify campaign outcomes into three categories.”
  • What the rules are — constraints it must respect, such as how the input data is structured or what the model architecture must look like.
  • Where to focus — specific areas to explore and improve, for example: “Try changing the loss function”.
  • What success looks like — the specific performance goals it should be optimising for (e.g., accuracy scores). This is also why the human expertise of the Data Science team remains critical.

Input 2: A Seed Program with an initial solution that you hand to AlphaEvolve to improve.

Rather than asking AlphaEvolve to build something from scratch, you give it a model that already works — and ask it to make it better. The team deliberately marks which parts AlphaEvolve is permitted to experiment with (using special labels in the code), and which parts must remain untouched. The Seed Program represents the accumulated expertise and investment already put into your AI models. AlphaEvolve doesn’t throw that away — it builds on top of it. It’s the difference between renovating a solid building versus demolishing it and starting over.

Input 3: The Target metric that AE will attempt to maximise in order to achieve our objective.

The Target Metric is essentially how the business defines “better.” This is a critical decision made by the Data Science team — not the AI. If the metric is well-chosen, AlphaEvolve will find solutions that genuinely deliver business value. If it’s poorly defined, the AI could optimize for the wrong thing entirely. Imagine you’re running a sales team and you’ve set a clear goal: maximise the conversion rate. Every change your team tries — new pitch, new pricing, new outreach method — gets evaluated against that one number. If a change improves the conversion rate, you keep it. The Target Metric works exactly the same way for AlphaEvolve. It might be something like “predict campaign performance as accurately as possible” — expressed as a single numerical score. AlphaEvolve runs each candidate model, checks the score, and keeps only the ones that do better. So the Target Metric is the objective, measurable definition of what winning looks like.

Input 4: The Stopping criteria.

The Stopping Criteria is simply the pre-agreed rule for when to call it done. Since AlphaEvolve could theoretically keep running and experimenting forever, the team sets clear boundaries upfront for when the experiment should end. A maximum number of rounds — e.g., “Run up to 500 iterations, then stop.” A performance threshold — e.g., “Stop as soon as the model reaches 90% accuracy.” This is like saying: “Once we’ve hit our goal, there’s no need to keep going.”

Output: a ranked list of improved AI models.

Figure 4 shows a ‘before’ (left) and ‘after’ (right) comparison of a section of a seed program that AlphaEvolve was asked to improve. Changes are highlighted in green. We observe several changes:

  • Training parameters were upgraded. For example, the number of training cycles (EPOCHS) was increased, the model’s internal size (PROJ_DIM) grew and a regularisation setting (WEIGHT_DECAY) was adjusted. These are the kind of fine-tuning decisions that would normally take a data scientist considerable time and experimentation to arrive at.
  • The model’s internal logic was redesigned. The component responsible for processing data (the “encoder”) was restructured and even renamed to better reflect its purpose. AlphaEvolve didn’t just tweak numbers. It proposed a more sophisticated architecture. New techniques were introduced.
Figure 4 \ Example of an evolved block of code where AE is permitted to modify the contents of this segment. The function contents are modified and changes in names are reflected in other code blocks appropriately. Note that training parameter values are suggested as well indicating compatible architectural changes with hyperparameter tuning.

Results: does it actually work?

AlphaEvolve was applied to two core problems:

  • Performance Prediction, which estimates a campaign’s performance based on its configuration.
  • Performance-aware recommendation, which suggests the optimal way to complete/update a campaign’s configuration, in order to maximise its performance.

Both models had already reached a highly competitive baseline with further manual improvements stalling below 1%.

Datasets

We evaluated all models on a suite of six datasets: five synthetic (details using an internally developed pipeline can be found here) and one real-world. This yielded datasets spanning a range of regimes: easy/medium/hard, depending on the noise profile and class balance – classes with a fewer samples are characterized as minority. Easy and imbalanced (V15), medium and imbalanced (V16), hard and imbalanced (V17), medium and balanced (V25, V26). The real-world dataset consists of actual historical campaign records and serves as the ultimate validation of whether gains observed on synthetic data transfer to production conditions.

Prediction

Three top-performing AE-evolved variants (Centroid_Loss, Cross_Modal_Attn, Focal_Loss) were identified across multiple experiments. All three consistently outperformed the base model across synthetic and real-world datasets.

In order to asses the model performance we use the industry standard F1 score that is a way of measuring how good a model is at classification (class ‘POS’ is high-performing, class ‘NEG’ is low-performing, class ‘AVG’ average-performing) , balancing two things: Precision — “When the model says something is positive, how often is it right?” Recall — “Out of all the actual positives, how many did the model catch?” If the model is good at one but terrible at the other, the F1 score will be low. We calculate the F1 score separately for each class (NEG, AVE, POS), then take the plain average avg F1-score.

  • On easy/medium synthetic data (V15, V16): Cross_Modal_Attn achieved the strongest overall performance, reaching 93.09% avg F1-score on V15 (vs. 90.22% baseline) and a striking +11.6 percentage point improvement on the hardest-to-classify minority class POS on V16 (POS F1: 80.20% vs. 68.61%).
  • On the hardest synthetic dataset (V17): Focal_Loss broke through a performance floor that other variants could not — the base model scored 0% on both minority classes (NEG and POS), while Focal_Loss achieved 15.83% and 25.39% respectively.
  • On real-world data: Centroid_Loss delivered the most practically significant gains — +8pp avg F1 (71% vs. 63%), +11.74pp NEG F1, +8.33pp POS F1, and +5.11pp accuracy — validating that AE’s improvements hold on actual production data.

Across all datasets and variants, gains on minority classes (correctly identifying hig-performing and low-performing campaigns) were consistently larger than gains on the majority class — a particularly valuable outcome given that minority-class accuracy is the critical input for the recommendation model.

Recommendation

The recommendation model, which relies on the prediction model’s outputs, was evaluated both in isolation and in a fully evolved end-to-end pipeline. The recommendation score (higher is better) is a metric that measures how good the recommendations are by comparing them against a known “ground truth” (applicable to synthetic datasets). It rewards recommendations that correctly identify high-performing campaign configurations, whereas it penalizes two kinds of failures: i) empty (the model couldn’t suggest anything) ii) of low-quality (the model suggested something, but it performs poorly).

  • Swapping in the AE-evolved predictor alone improved recommendation scores meaningfully: +6.5% on easy data (V15), +9.8% on medium data (V16), and lifted the hard dataset (V17) from a score of 0.0 (which essentially means that all recommendations were wrong) to 0.29.
  • Combining the AE-evolved predictor with an AE-evolved recommender produced the strongest results across all datasets, with the fully evolved pipeline achieving scores of 0.5 (V15), 0.4 (V16), and 0.36 (V17) — confirming that the gains from prediction and recommendation evolution are additive.
  • Recommendation improvements of up to 7% were observed when both components were evolved together.

Conclusion

AlphaEvolve works — and it works exceptionally well. It represents a meaningful and measurable step forward in model development. Applied to WPP AI Lab’s campaign prediction and recommendation models, which had already reached a performance plateau through conventional means, AlphaEvolve delivered prediction accuracy gains of up to 10% on both synthetic and real datasets, while simultaneously lifting downstream recommendation scores by up to 7%. It surfaces architectural strategies and configurations that lie beyond the reach of human intuition alone, not by replacing the expertise of our Data Science team, but by amplifying it. The human-in-the-loop dynamic remains essential: our scientists shape the search space, define meaningful constraints, and validate the outputs.

AlphaEvolve does the heavy lifting of exploration. As prediction and recommendation models continue to grow in complexity, AlphaEvolve offers a glimpse of a future where the gap between data collection and model improvement is measured in hours rather than weeks, and where the best-performing systems are not just built by experts, but co-designed with AI.

This project was a collaboration between the WPP Research team including: Anastasios Tsourtis and Theodoros Lappas and the AI for Science team at Google Cloud including (but not limited to): Kartik Sanu, Laurynas Tamulevičius, Nicolas Stroppa, Chris Page, Gary Ng, John Semerdjian, Skandar Hannachi, Vishal Agarwal, and Anant Nawalgaria, Gabriela Hernandez Larios and partners at Google DeepMind

References

  1. Novikov, A., Vũ, N., Eisenberger, M., Dupont, E., Huang, P.-S., Wagner, A. Z., Shirobokov, S., Kozlovskii, B., Ruiz, F. J. R., Mehrabian, A., Kumar, M. P., See, A., Chaudhuri, S., Holland, G., Davies, A., Nowozin, S., Kohli, P., & Balog, M. (2025). AlphaEvolve: A coding agent for scientific and algorithmic discovery. arXiv:2506.13131 [cs.AI]. https://arxiv.org/abs/2506.13131

Ready to explore the specifics? Read our full technical deep dive into the technical report for a closer look at our methodology.

Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP Research team.

Authors

  • Anastasios is a Senior Data Scientist at Satalia’s Research Lab. Holding a Ph.D. in Applied Mathematics, his work bridges the gap between deep learning and complex data science problems. He has a strong background in both theoretical research and practical applications, specializing in generative AI tailored for multi-modal data and structured synthetic data.

  • Anant is a Senior Staff Machine Learning Architect and Product Leader at Google. As a co-author of research at top conferences like ICLR and ICML , his work bridges the gap between cutting-edge AI research and scalable production. He has a strong background in both theoretical research and practical applications, specializing in Generative AI, Multi-Agent Systems, and novel architectures for multimodal LLMs.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *