Marketing teams struggle to translate past campaign performance data into actionable future decisions, and while Neural Network models can help, manual experimentation to improve them is slow, costly, and quickly hits a ceiling. To overcome this, we leveraged AlphaEvolve (AE)—Google DeepMind’s Gemini-powered agentic framework—which autonomously proposes, evaluates, and evolves model architectures in an iterative loop, removing the need for manual trial-and-error. The result: up to 10% improvement in prediction accuracy and up to 7% in recommendation scores over our competitive baselines, achieved in a fraction of the time, and positioning WPP with a first-mover advantage through early access to this technology.
If you don’t care about the technical details, read our blog post instead.
Introduction – Scope of this work
Google AlphaEvolve (AE) [1] is a Gemini-powered agentic framework that reframes model development as an evolutionary search problem. Rather than relying on static optimisation or manual experimentation, AE continuously improves existing algorithms by executing an iterative “generate, test, and refine” loop: at each step, candidate programs are proposed by a large language model, evaluated against a user-defined metric, and fed back into an evolving database of solutions. As a result, the system is capable of autonomously exploring a vast space of model configurations in a semi-supervised manner, without requiring explicit enumeration of all candidate designs.
The target of this work is to improve two core components of our marketing campaign intelligence pipeline: a prediction model and a recommendation model. Both models had already reached a highly competitive baseline; however, further progress had stalled.
Despite sustained effort — encompassing bibliographic research, trial-and-error experimentation, and systematic fine-tuning — incremental improvements remained below 1%, and in certain noisy data regimes, accuracy degraded rather than improved. This plateau is particularly consequential in the context of marketing campaign optimisation, where even gains of a few basis points in prediction accuracy can translate into thousands of dollars of savings. Given the scale of the model search space and the cost of manual exploration, AE presents a unique opportunity to dramatically accelerate the discovery of superior model architectures.
The results of these experiments are compelling. The AE-evolved prediction models achieved an improvement in prediction accuracy of up to 10% over the baseline model. Across both neural and non-neural baselines, as well as fine-tuned Gemma variants, AE-evolved prediction models achieved consistently higher accuracy, validating the hypothesis that evolutionary search can identify architectures that manual experimentation fails to reach in a comparable time frame.
This report is structured as follows: Section 2 describes how AlphaEvolve works, covering its inputs, inner workings, and output format. Section 3 details the best practices we identified through extensive experimentation. Section 4 describes the base models used as seed programs to be evolved. Section 5 presents and discusses the quantitative results across synthetic and real-world datasets.
How AE works
AlphaEvolve is a coding agent that orchestrates a fully autonomous, distributed pipeline of computations — including queries to large language models — to produce algorithms that address a user-specified task. Implemented as an asynchronous pipeline using Python’s asyncio library, the system operates as an evolutionary algorithm: at each iteration, it proposes candidate programs, evaluates them against a target metric, and progressively improves its population of solutions. The result is a search procedure that continuously refines programs toward higher scores without requiring manual intervention between iterations.
Figure 1 shows the AlphaEvolve pipeline. Note the bidirectional arrow showing that the program database is being used to suggest new programs as well as getting updated with promising solutions through iterations.

Figure 1 \ (Image taken from [1]) AlphaEvolve discovery process. The user provides an initial program (with components to evolve marked), evaluation code, and optional configurations. AlphaEvolve then initiates an evolutionary loop. The Prompt sampler uses programs from the Program database to construct rich prompts. Given these prompts, the LLMs generate code modifications (diffs), which are applied to create new programs. These are then scored by Evaluators, and promising solutions are registered back into the Program database, driving the iterative discovery of better and better programs.
Core pipeline inputs
Launching an AE experiment requires six key inputs.
The first is a seed program: an existing, functional codebase with appropriately annotated functions and docstrings, which serves as the starting point for the evolutionary search. AE provides flexibility in its abstraction level, allowing it to tackle problems by either directly evolving the solution itself or by evolving a constructor function that builds the solution (which is often more effective for highly symmetric problems). Users can also provide function templates containing only a detailed docstring and a constant return value, prompting AE to generate the full implementation from scratch.
The second is a target metric: a scalar value that AE will attempt to maximise iteratively. In our prediction model experiments, we used the macro-average F1-score rather than accuracy, as it proved more effective at steering the search toward better-generalising solutions across imbalanced classes. When multiple metrics are relevant, a weighted linear combination can be used to indirectly optimise them simultaneously.
The third input is an evaluation function that takes a candidate program and returns the target metric value. Because it is called iteratively to check every new candidate, computational efficiency is critical. In our case, this function fully trains the candidate neural architecture and measures its macro F1-score on a held-out set. The runtime of this function is variable: it depends on the architectural complexity of the candidate (e.g., multi-head attention blocks are slower to compute than linear layers), and because batch size and number of epochs are also subject to evolution, training times can fluctuate significantly across candidates. Early stopping criteria are applied to bound evaluation cost.
The fourth input is a system prompt that guides the direction of evolution — It provides explicit suggestions on which aspects of the code should be evolved and how, for instance: description of loss functions or search paths to consider. Together with the seed code, the system prompt forms the initial input to the program database sampling process: in effect, it determines both where the search begins and in which direction it is steered.
The fifth input defines the evolution blocks: contiguous code regions marked with special comments (# EVOLVE-BLOCK-START / # EVOLVE-BLOCK-END) that AE is permitted to modify, while the rest of the codebase remains intact. Carefully scoping these blocks is important — they must be consistent with inter-function dependencies and library imports in the surrounding fixed code. Additional constraints, such as the set of Python libraries available or explicit enumeration of architectural ideas to try, can be incorporated via docstrings or the system prompt.
The sixth and final input is a stopping criterion — a maximum number of iterations or a target metric threshold — bounds the experiment.
Inner workings
At the core of AlphaEvolve is an evolutionary database inspired by the MAP-elites algorithm [2] and island-based population models [3, 4]. Candidate programs and their scores are stored and curated in this database, which is continuously updated as new solutions are evaluated. The database is designed to balance two competing objectives: exploration, by maintaining diversity across the solution space, and exploitation, by prioritising refinement of the highest-scoring programs.
To generate diverse candidate proposals, AE relies on a mixture of large and small language models — in our experiments, a combination of Gemini 3 Pro and Gemini 3 Flash — which sample from the current program database alongside the seed code and system instructions to produce modified candidate programs. Gemini 3.0 Flash, with its lower latency, enables a higher rate of candidate generation, increasing the volume of ideas explored per unit of time, while Gemini 3.0 Pro provides occasional higher-quality suggestions that can significantly advance the search.
When an LLM is asked to modify existing code within a larger codebase, it outputs changes as a sequence of SEARCH/REPLACE diff blocks in a structured format — specifying exactly which code segment to replace and with what — allowing for targeted, composable modifications. In cases where the evolved code is very short or a complete rewrite is appropriate, AE can instead instruct the LLM to output the full code block directly.
Two additional mechanisms further enhance the quality of the search:
- The first is meta-prompt evolution: AE co-evolves the system prompt itself through a separate LLM step that generates and refines meta-instructions in a dedicated database alongside the candidate programs. This allows the guidance given to the code-generation models to improve over time, potentially surpassing what a human prompter could craft manually — a finding confirmed in ablation experiments reported in the original paper.
- The second is LLM-generated feedback: beyond the user-provided evaluation function, AE can invoke separate LLM calls to assess qualitative properties of candidate programs that are difficult to measure programmatically, such as code simplicity or readability. These soft scores can steer evolution or filter out undesirable candidates. In our experiments, we observed a beneficial side effect of this mechanism: AE spontaneously added informative comments to complex code segments within evolution blocks that it chose not to modify, improving both readability and the quality of subsequent evolutionary steps.
Output
The primary output of an AE experiment is a functional program: a candidate code that achieves the highest target score across all evaluated iterations. Because LLMs occasionally hallucinate, some candidate programs may fail to execute; however, AE handles this gracefully — execution is isolated per candidate, and failed programs are assigned a default score (see next section) while the asynchronous pipeline continues uninterrupted.
In our experiments, we ran each AE session for up to 1,000 evolutionary iterations, with 4 concurrent candidate evaluations at each step. The final evolved programs, together with all intermediate candidates, can be retrieved post-experiment using their session_id, experiment_id, and program_id identifiers, enabling retrospective inspection of the evolutionary trajectory and diff-level comparison against the seed program.

Figure 2\ Example of an evolution block suggestion. Here AE suggested a parametric non-linear combination of the Cross Entropy loss addressing class imbalance in dataset V17.
Best practices when using AE
AlphaEvolve has a shallow entry-point: the user needs only to supply a seed program, an evaluation function, and a maximum number of iterations to launch an experiment. In practice, however, the quality of the outcomes depends strongly on a set of design choices that span prompt engineering, code instrumentation, metric design, and infrastructure planning. Based on our own experiments, we distilled the following guidelines for practitioners.
1. Invest in the system prompt
The system prompt is the single most impactful lever available to the user. A prompt that names concrete directions to explore — particular loss families, regularisation strategies, architectural motifs, or hyper-parameter ranges — consistently produces more diverse and higher-scoring candidates than a short, generic instruction. Experimentally, a vague or ambiguous prompt tended to generate overlapping candidate programs with low structural diversity, which hinders the evolutionary search even though AE’s database design is intended to discourage convergence to local minima.
This observation is corroborated by the paper’s ablation experiments: removing explicit context from the prompt caused a significant drop in discovery performance across both the matrix multiplication and kissing number tasks. Furthermore, AE supports meta-prompt evolution, as we mentioned earlier. In practice, this means the system can progressively improve its own guidance — but this process is seeded by the human-authored prompt, so a strong starting prompt remains essential. Beyond free-text instructions, the prompt can be enriched with explicit context in the form of equations, code snippets, or even PDF attachments of relevant literature.
2. Place evolution blocks deliberately
Evolution blocks (delimited by # EVOLVE-BLOCK-START / # EVOLVE-BLOCK-END markers) define the search space (Figure 2). Code outside these markers is treated as a fixed skeleton. Several considerations arise in practice:
- Scope blocks to the right level of abstraction. AE can evolve a single function, several functions together, or an entire codebase. Broader blocks give AE more latitude but also increase the chance of inconsistencies across function boundaries. Narrower blocks constrain search but keep the skeleton stable.
- Ensure interface consistency. If the system prompt permits AE to introduce new functions, the evolution block must cover all call sites; otherwise, the fixed code will reference function names that do not exist in the evolved program. This was one of the most common sources of crashing candidates in our experiments.
- Expose hyper-parameters. Allowing AE to evolve batch size, learning rate, network dimensions, and weight decay treated these as first-class search variables rather than fixed settings. All three of our best-performing variants had hyper-parameters discovered by AE rather than by grid search.
- Consider the abstraction level carefully. The paper notes that for problems with highly symmetric solutions, evolving a constructor function (which builds the solution from scratch) tends to be more effective than evolving the solution directly. For asymmetric problems — such as our neural architecture — evolving the implementation itself is usually more appropriate, so we opted for that.
3. Design the evaluation function with care
The evaluation function is the primary feedback signal for the evolutionary loop. Several properties matter:
- Speed. Each candidate program must be evaluated before its score can propagate back into the database. Slow evaluations directly limit the number of generations that can be completed within a fixed compute budget. In our case, training a neural network on each candidate introduced variable evaluation times depending on architectural complexity — a cost we accepted in exchange for richer feedback.
- Default score for failures. If the target metric is non-negative and monotonically increasing, assign a default score of -1 or (-♾️) to candidates that crash or produce invalid output. This prevents degenerate programs from polluting the database with artificially neutral scores.
- Metric choice. Use a single scalar as the primary optimisation target. For imbalanced classification, we found that positive-class F1-score led to better evolved solutions than accuracy, because it forced AE to address minority-class performance. When multiple qualities matter simultaneously, a weighted linear combination of scalars can be used. Importantly, the paper notes that optimising multiple metrics often improves single-metric results as well: diverse, high-performing programs across different criteria expose the LLMs to a broader solution space, increasing the chance of discovering novel approaches for the target metric.
- Evaluation cascade. AE supports multi-stage hypothesis testing in which new candidates are first screened on simpler or smaller-scale inputs before being subjected to the full evaluation. This prunes clearly faulty or unpromising programs early and is particularly valuable when full evaluations are expensive. We observed that target scores improved gradually but steadily over hundreds of iterations, with solution complexity increasing as the search matured — a pattern consistent with the cascade’s role in maintaining a healthy exploration–exploitation balance. So run for at least a few hundred iterations. AE’s evolutionary database requires time to accumulate a diverse pool of high-quality programs from which future generations can be sampled.
4. Manage experiments actively
- Monitor intermediate candidates. During execution, target scores of successful candidates are visible in real time. Periodically inspecting the diff between the seed program and a high-scoring intermediate candidate — using a tool such as a code editor’s diff view (see Figure 2)— can reveal whether the system prompt is effective, whether evolution blocks are correctly placed, and whether the candidate programs are structurally sound. Early detection of issues saves significant compute.
- Always verify solutions manually. LLMs can hallucinate plausible-looking but subtly incorrect code. In one of our experiments, a candidate that achieved a high target score did so by inadvertently overfitting: the missing model.eval() / model.train() calls meant that dropout and batch normalisation were applied incorrectly at evaluation time. Human review of the top-performing programs before committing to them is essential.
- Seed subsequent experiments from the best prior result. Once an experiment concludes, launching a new run seeded with the best-performing program — while retaining the same system prompt — allows AE to continue refining from a strong starting point rather than restarting from the original seed.
5. Plan infrastructure appropriately
Choose a Virtual Machine (VM) whose hardware matches the demands of the evaluation function. For neural architecture search, a GPU-accelerated machine substantially reduces per-evaluation training time. More critically, evolved programs can introduce architectural changes — wider layers, additional attention heads, larger batch sizes — that increase memory consumption beyond what the seed program required. Out-of-memory errors during evaluation stall the experiment without providing useful signal. Sizing RAM and VRAM conservatively against the maximum plausible candidate complexity, rather than the seed’s footprint, prevents this failure mode.
- The user can invoke AE API as another Google Cloud Platform (GCP) service. There is flexibility to call the API (user account with appropriate permissions is set up once) provided that the user is authenticated via
gcloud-cli. There are two options to create an experiment: 1) create, start and connect to a virtual machine (VM) on a GCP project, so that cloud resources are utilized (recommended), 2) run locally, using a proprietary machine.
6. Guard against memory leaks in the fixed code skeleton
AE executes each candidate program by adding it to an asynchronous queue, where it is run inside the user-provided evaluation function. A key responsibility boundary applies here: AE is only in control of the code inside the evolution blocks. Everything outside — the fixed skeleton, data loading, model initialisation, and the evaluation function itself — is entirely the user’s code and runs as-is for every single candidate evaluation across potentially thousands of iterations.
This makes memory management in the fixed code important. In our experiments, we encountered a case where a fixed function outside the evolution blocks was not properly releasing memory objects after each evaluation call. Because this code ran once per candidate — and experiments run for hundreds or thousands of iterations — the leak caused memory objects to accumulate steadily, eventually leading to experiment failure.
To prevent this, we recommend the following hygiene practices for all code outside evolution blocks:
- Explicitly empty GPU/CPU caches after each evaluation call (e.g.
torch.cuda.empty_cache()in PyTorch). - Delete intermediate objects that are no longer needed using
del, particularly large tensors or model instances created within the evaluation loop. - Avoid global variable declarations outside the evaluation function — globals persist across calls and are a common source of unintended object retention.
- Add explicit garbage collection calls (e.g.
import gc; gc.collect()) at the end of the evaluation function, especially when dealing with large objects or when GPU memory is constrained.
The broader principle is to treat the evaluation function as a self-contained, stateless unit: everything it allocates should be released before it returns.
7. Choose codebase style
The codebase can be either self-contained in a Jupyter notebook or in the form a repository-style. In the first case the seed program is provided as a long string whereas in the latter the user indicates the file(s) to be evolved. As mentioned Evolution blocks are indicated by special markers as comments (# EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END). Repository-style codebases are easier to track using version control and by using IDE file diff functionality for the candidate programs.
Base models
Prediction model
The base prediction model is a neural classifier that takes pre-computed feature embeddings as input and assigns each sample to one of three campaign performance classes: positive, average, or negative. Architecturally, it consists of modality-specific encoding layers that project the input embeddings into a lower-dimensional latent space, followed by a fusion layer that combines these projections to produce the final class prediction. Through systematic experimentation across a range of loss functions and their combinations, we identified a weighted mixture of classification and alignment losses as the best-performing training objective. The model is optimised using AdamW (Adam with weight decay) and employs early stopping to prevent overfitting. To rigorously assess the strengths and limitations of each model variant, we constructed a suite of synthetic datasets spanning a range of difficulty levels — varying in complexity, class cardinality, dataset size, and noise level.
Recommendation model
The recommendation model is used to ‘fill in the missing puzzle pieces’. A user provides a partial query with missing feature values and the model completes the remaining with appropriate values based on past successful campaign knowledge. The recommendation model relies on the prediction model as an input, where this knowledge is distilled in the model weights. We compile indices for quick look-up and use the top-K recommendations based on approximate K-NN search in the projection latent space. A design decision was to run separate, independent optimisation experiments for each model rather than a joint optimisation, making sequential rather than simultaneous optimisation the natural approach. For the synthetic datasets where the ground truth known and represented as a graph, we devised a metric to score the success of good candidates. This metric considers empty recommendations and at the same time penalizes for low-performing ones. We refer to the normalized value of this metric as avg norm score (Table 3).
Results
Inspecting the top performing prediction programs
We present the main points for the three top performing AE programs of multiple experiments for the prediction model. Table 1 summarizes the differences with the base program (seed).
- Upon launching a plethora of AE experiments using different input datasets and refining the system prompt accordingly, we chose three best performing variants and use the naming convention Centroid_Loss, Cross_Modal_Attn and Focal_Loss henceforth (Table 1).
- All three AE-discovered variants abandoned the base model’s shared encoder design, opting instead for separate, identical encoders for each input modality — suggesting the evolutionary search consistently converged on specialised, non-shared representations that were initially used to balance network capacity with data size.
- Each variant introduced a distinct custom loss function replacing the base contrastive loss: Centroid_Loss uses Centroid Loss, Cross_Modal_Attn uses a Hyper-spherical centroid loss (used for inter-sample and intra-sample alignment), and Focal_Loss a Cosine Centroid Attraction loss (on the positive class) — indicating that loss function innovation was the primary lever AE exploited to improve performance.
- Cross_Modal_Attn introduced the most complex architectural changes: Beyond the loss, Cross_Modal_Attn added a gating layer on the encoder, label smoothing, and an entanglement-based classifier with cross-modal attention — making it the most structurally different from the base model among the three. On the contrary,Centroid_Loss focused on regularisation and normalisation refinements: Its changes (layer norms and different activation functions) are relatively conservative compared to Cross_Modal_Attn/Focal_Loss, suggesting that even lightweight, targeted modifications discovered by AE can yield meaningful gains.
- Cross_Modal_Attn and Focal_Loss also introduced architectural innovations beyond the loss, with Cross_Modal_Attn adding a gating layer on the encoder and an entanglement-based classifier with modal attention, and Focal_Loss incorporating Focal loss (targeting class imbalance) and batch normalisation in the classifier — showing AE’s ability to co-evolve both loss and architecture simultaneously.
- Hyper-parameter tuning was itself part of the evolved solution: all three variants had learning rate and network dimensions evolved by AE, while Centroid_Loss and Focal_Loss additionally evolved weight decay (optimizer) — demonstrating that AE treated hyper-parameters as first-class search variables, not just fixed settings. We should note here that hyper-parameter tuning on our base model was done using grid search.
- The three variants were evolved on different target datasets (Centroid_Loss & Cross_Modal_Attn on V16, Focal_Loss on V17), reflecting a deliberate strategy of running separate AE experiments tailored to datasets of varying difficulty, rather than a single universal optimisation. Nevertheless Centroid_Loss and Cross_Modal_Attn performance is similar accross datasets (Table 2) indicating robustness.
| base model | Centroid_Loss | Cross_Modal_Attn | Focal_Loss | |
|---|---|---|---|---|
| encoder layer | shared + dedicated for GEO | separate identical encoders | separate identical encoders | separate identical encoders |
| extra Loss | contrastive with margin | Centroid Loss | Hyper-spherical centroid loss | Cosine Centroid Attraction loss |
| changes | – | layernorm’s | gating layer on encoder, label smoothing in Loss | Focal loss (for class imbalance) |
| Classifier | MLP | different activation functions | LR, network dims | added batch normalization |
| Hyper-param tuning | grid search | LR, network dims, weight decay | entanglement-based with cross-modal Attention | LR, network dims, weight decay |
| Target dataset | – | V16 | V16 | V17 |
Results on Prediction
| dataset | metric | base model | Centroid_Loss | Cross_Modal_Attn | Focal_Loss |
|---|---|---|---|---|---|
| V15 | avg F1-score | 90.22% | 92.65% | 93.09% | 89.60% |
| easy, imbalanced | NEG F1-score | 90.45% | 92.20% | 92.12% | 90.70% |
| AVE F1-score | 97.50% | 98.02% | 98.16% | 96.85% | |
| POS F1-score | 82.89% | 87.74% | 88.99% | 81.25% | |
| extra V16 | avg F1-score | 78.40% | 85.92% | 85.72% | 71.22% |
| med, imbalanced | NEG F1-score | 71.57% | 81.30% | 80.54% | 59.91% |
| AVE F1-score | 95.00% | 96.53% | 96.41% | 88.79% | |
| POS F1-score | 68.61% | 79.92% | 80.20% | 64.96% | |
| V17 | avg F1-score | 30.32% | 30.34% | 33.33% | 34.91% |
| hard, imbalanced | NEG F1-score | 0% | 0% | 0.26% | 15.83% |
| AVE F1-score | 90.98% | 90.97% | 90.04% | 63.50% | |
| POS F1-score | 0% | 0.06% | 9.70% | 25.39% | |
| V25 | avg F1-score | 86.48% | 89.75% | 89.48% | 81.65% |
| balanced | NEG F1-score | 89.37% | 91.30% | 91.30% | 87.42% |
| AVE F1-score | 82.78% | 86.66% | 86.16% | 71.79% | |
| POS F1-score | 87.30% | 91.28% | 90.98% | 85.73% | |
| V26 | avg F1-score | 85.28% | 89.11% | 87.68% | 79.30% |
| balanced | NEG F1-score | 89.75% | 91.11% | 88.88% | 83.49% |
| AVE F1-score | 80.90% | 86.02% | 84.43% | 67.46% | |
| POS F1-score | 85.18% | 90.22% | 89.71% | 86.95% | |
| Real | avg F1-score | 63.00% | 71.00% | 71.12% | 57.44% |
| imbalanced | NEG F1-score | 52.00% | 63.74% | 62.78% | 56.53% |
| AVE F1-score | 82.00% | 85.21% | 85.37% | 59.70% | |
| POS F1-score | 56.00% | 64.33% | 65.20% | 56.10% | |
| Accuracy | 72.00% | 77.11% | 77.37% | 57.75% |
Key Findings
- AE-evolved variants consistently outperform the already well-performing base model across almost all datasets and metrics: Centroid_Loss and Cross_Modal_Attn show gains on V15, V16, V25, and V26, confirming that AE’s evolutionary search reliably identifies architectures superior to our baseline.
- Cross_Modal_Attn is the best performer on easy/medium synthetic data (V15, V16): It leads on avg F1 for V15 (93.09% vs. base 90.22%) and ties Centroid_Loss on V16 (85.72% vs. 85.92%), while also achieving the highest POS F1 on V16 (80.20% vs. base 68.61%) — a striking +11.6pp improvement on the hardest-to-classify minority class.
- Focal_Loss is uniquely effective on the hardest dataset (V17): While the base model and Centroid_Loss score 0% on both NEG and POS F1 (essentially collapsing to a degenerate classifier), Focal_Loss achieves 15.83% NEG and 25.39% POS F1 — breaking through a performance floor that the other variants could not overcome.
- The minority class (NEG/POS) gains are consistently larger than the average class (AVE) gains: For example on V16, the base AVE F1 is already 95.00% and Centroid_Loss only improves it to 96.53%, whereas POS F1 jumps from 68.61% to 80.20%. This pattern repeats across datasets, indicating AE’s variants are particularly effective at addressing class imbalance. This is particularly important as correct identification of the positive class (well performing campaigns) is critical for the recommendation model.
- Centroid_Loss and Cross_Modal_Attn show similar and robust performance across datasets: Despite being evolved independently on V16, both variants score comparably on V15, V16, V25, and V26, suggesting AE converged on a similarly generalizable solution space from different starting conditions — evidence of stability in the evolutionary search.
- Centroid_Loss is the strongest all-round performer on balanced/moderate synthetic data (V25, V26): It leads avg F1 on both V25 (89.75% vs. base 86.48%) and V26 (89.11% vs. base 85.28%), consistently outperforming Cross_Modal_Attn and Focal_Loss on these datasets. On the contrary Focal_Loss underperforms on all datasets except V17, suggesting it was over-specialised for the hard imbalanced regime and does not generalise well to easier distributions.
- Gains on real data are the most practically significant: On the real-world imbalanced dataset, Centroid_Loss achieves a +8pp avg F1 (71% vs. 63%), +11.74pp NEG F1 (63.74% vs. 52%), +8.33pp POS F1 (64.33% vs. 56%), and +5.11pp accuracy (77.11% vs. 72%) — the largest absolute improvements across all tested datasets, validating that AE’s gains hold on actual production data. Cross_Modal_Attn attains a comparable improvement as well.
Results on Recommendation
| dataset | metric | base Predictor/ base Recommender | AE Predictor/ base Recommender | AE Predictor/ AE Recommender |
|---|---|---|---|---|
V15 easy, imbalanced | avg normalized score | 0.4358 | 0.4641 | 0.5019 |
V16 med, imbalanced | avg normalized score | 0.3094 | 0.3397 | 0.3973 |
| V17 hard, imbalanced | avg normalized score | 0.0 | 0.2915 | 0.3615 |
| V25 balanced | avg normalized score | 0.5101 | 0.5793 | 0.6055 |
| V26 balanced | avg normalized score | 0.5169 | 0.5535 | 0.5857 |
Key Findings
- Combining AE-evolved components at both stages yields the strongest results across all datasets: The fully AE pipeline (AE Predictor / AE Recommender) consistently achieves the highest average normalised score on every tested dataset — 0.5019 on V15, 0.3973 on V16, and 0.3615 on V17 — confirming that gains from prediction and recommendation evolution are additive.
- The AE Predictor alone delivers meaningful gains over the baseline pipeline: Even when paired with the base Recommender, replacing the base Predictor with its AE-evolved counterpart improves the normalised score by +6.5% on V15 (0.4358 → 0.4641), +9.8% on V16 (0.3094 → 0.3397), and lifts V17 from a score of 0.0 to 0.2915 — demonstrating that prediction quality is a critical upstream bottleneck for recommendation performance.
- The hardest dataset (V17) shows the most dramatic absolute improvement: The base pipeline scores 0.0 on V17, meaning it produces no valid recommendations under the most challenging data regime. The AE Predictor alone breaks this failure floor (0.2915), and the fully AE pipeline extends it further to 0.3615.
- The recommendation model benefits from AE evolution beyond what better predictions alone provide: On all three datasets, the jump from (AE Predictor / base Recommender) to (AE Predictor / AE Recommender) is consistent and substantial — +0.0378 on V15, +0.0576 on V16, +0.0700 on V17 — indicating that the recommendation model itself harbours independent optimisation headroom that AE successfully exploits.
- Gains scale with dataset difficulty: The absolute improvement of the fully AE pipeline over the all-base baseline increases as difficulty grows — +0.0661 on V15, +0.0879 on V16, and +0.3615 on V17 (from zero). This suggests AE’s evolutionary search is particularly valuable in challenging, noisy data regimes where conventional optimisation approaches stall.
Conclusion
- The results are compelling: the evolved models achieved an improvement in prediction accuracy of up to 10% over the baseline model.
- Prediction: No single variant dominates across all datasets: Cross_Modal_Attn leads on V15, Centroid_Loss leads on V25/V26/Real, and Focal_Loss is indispensable for V17 — implying that for a production deployment spanning diverse data regimes, an ensemble or dataset-aware model selection strategy would be optimal.
- Recommendation: The evolved prediction model alone delivers meaningful gains over the base Recommendation. Combining the best prediction and Recommendation models improves recommendation scores even further.
This project was a collaboration between the WPP Research team including: Anastasios Tsourtis and Theodoros Lappas and the AI for Science team at Google Cloud including (but not limited to): Kartik Sanu, Laurynas Tamulevičius, Nicolas Stroppa, Chris Page, Gary Ng, John Semerdjian, Skandar Hannachi, Vishal Agarwal, and Anant Nawalgaria, Gabriela Hernandez Larios and partners at Google DeepMind
References
- Novikov, A., Vũ, N., Eisenberger, M., Dupont, E., Huang, P.-S., Wagner, A. Z., Shirobokov, S., Kozlovskii, B., Ruiz, F. J. R., Mehrabian, A., Kumar, M. P., See, A., Chaudhuri, S., Holland, G., Davies, A., Nowozin, S., Kohli, P., & Balog, M. (2025). AlphaEvolve: A coding agent for scientific and algorithmic discovery. arXiv:2506.13131 [cs.AI]. https://arxiv.org/abs/2506.13131
- Mouret, J.-B., & Clune, J. (2015). Illuminating search spaces by mapping elites. arXiv:1504.04909 [cs.AI]. https://arxiv.org/abs/1504.04909
- Romera-Paredes, B., Barekatain, M., Novikov, A., Balog, M., Kumar, M. P., Dupont, E., Ruiz, F. J. R., Ellenberg, J. S., Wang, P., Fawzi, O., Kohli, P., & Fawzi, A. (2024). Mathematical discoveries from program search with large language models. Nature, 625(7995), 468–475. https://doi.org/10.1038/s41586-023-06924-6
- Tanese, R. (1989). Distributed genetic algorithms for function optimization. University of Michigan.


