Author: Anastasios Tsourtis

  • AlphaEvolve Pod

    AlphaEvolve Pod

    Marketing teams struggle to translate past campaign performance data into actionable future decisions, and while Neural Network models can help, manual experimentation to improve them is slow, costly, and quickly hits a ceiling. To overcome this, we leveraged AlphaEvolve (AE)—Google DeepMind’s Gemini-powered agentic framework—which autonomously proposes, evaluates, and evolves model architectures in an iterative loop, removing the need for manual trial-and-error. The result: up to 10% improvement in prediction accuracy and up to 7% in recommendation scores over our competitive baselines, achieved in a fraction of the time, and positioning WPP with a first-mover advantage through early access to this technology.

    If you don’t care about the technical details, read our blog post instead.

    Introduction – Scope of this work

    Google AlphaEvolve (AE) [1] is a Gemini-powered agentic framework that reframes model development as an evolutionary search problem. Rather than relying on static optimisation or manual experimentation, AE continuously improves existing algorithms by executing an iterative “generate, test, and refine” loop: at each step, candidate programs are proposed by a large language model, evaluated against a user-defined metric, and fed back into an evolving database of solutions. As a result, the system is capable of autonomously exploring a vast space of model configurations in a semi-supervised manner, without requiring explicit enumeration of all candidate designs.

    The target of this work is to improve two core components of our marketing campaign intelligence pipeline: a prediction model and a recommendation model. Both models had already reached a highly competitive baseline; however, further progress had stalled.

    Despite sustained effort — encompassing bibliographic research, trial-and-error experimentation, and systematic fine-tuning — incremental improvements remained below 1%, and in certain noisy data regimes, accuracy degraded rather than improved. This plateau is particularly consequential in the context of marketing campaign optimisation, where even gains of a few basis points in prediction accuracy can translate into thousands of dollars of savings. Given the scale of the model search space and the cost of manual exploration, AE presents a unique opportunity to dramatically accelerate the discovery of superior model architectures.

    The results of these experiments are compelling. The AE-evolved prediction models achieved an improvement in prediction accuracy of up to 10% over the baseline model. Across both neural and non-neural baselines, as well as fine-tuned Gemma variants, AE-evolved prediction models achieved consistently higher accuracy, validating the hypothesis that evolutionary search can identify architectures that manual experimentation fails to reach in a comparable time frame.

    This report is structured as follows: Section 2 describes how AlphaEvolve works, covering its inputs, inner workings, and output format. Section 3 details the best practices we identified through extensive experimentation. Section 4 describes the base models used as seed programs to be evolved. Section 5 presents and discusses the quantitative results across synthetic and real-world datasets.

    How AE works

    AlphaEvolve is a coding agent that orchestrates a fully autonomous, distributed pipeline of computations — including queries to large language models — to produce algorithms that address a user-specified task. Implemented as an asynchronous pipeline using Python’s asyncio library, the system operates as an evolutionary algorithm: at each iteration, it proposes candidate programs, evaluates them against a target metric, and progressively improves its population of solutions. The result is a search procedure that continuously refines programs toward higher scores without requiring manual intervention between iterations.

    Figure 1 shows the AlphaEvolve pipeline. Note the bidirectional arrow showing that the program database is being used to suggest new programs as well as getting updated with promising solutions through iterations.

    Figure 1 \ (Image taken from [1]) AlphaEvolve discovery process. The user provides an initial program (with components to evolve marked), evaluation code, and optional configurations. AlphaEvolve then initiates an evolutionary loop. The Prompt sampler uses programs from the Program database to construct rich prompts. Given these prompts, the LLMs generate code modifications (diffs), which are applied to create new programs. These are then scored by Evaluators, and promising solutions are registered back into the Program database, driving the iterative discovery of better and better programs.

    Core pipeline inputs

    Launching an AE experiment requires six key inputs.

    The first is a seed program: an existing, functional codebase with appropriately annotated functions and docstrings, which serves as the starting point for the evolutionary search. AE provides flexibility in its abstraction level, allowing it to tackle problems by either directly evolving the solution itself or by evolving a constructor function that builds the solution (which is often more effective for highly symmetric problems). Users can also provide function templates containing only a detailed docstring and a constant return value, prompting AE to generate the full implementation from scratch.

    The second is a target metric: a scalar value that AE will attempt to maximise iteratively. In our prediction model experiments, we used the macro-average F1-score rather than accuracy, as it proved more effective at steering the search toward better-generalising solutions across imbalanced classes. When multiple metrics are relevant, a weighted linear combination can be used to indirectly optimise them simultaneously.

    The third input is an evaluation function that takes a candidate program and returns the target metric value. Because it is called iteratively to check every new candidate, computational efficiency is critical. In our case, this function fully trains the candidate neural architecture and measures its macro F1-score on a held-out set. The runtime of this function is variable: it depends on the architectural complexity of the candidate (e.g., multi-head attention blocks are slower to compute than linear layers), and because batch size and number of epochs are also subject to evolution, training times can fluctuate significantly across candidates. Early stopping criteria are applied to bound evaluation cost.

    The fourth input is a system prompt that guides the direction of evolution — It provides explicit suggestions on which aspects of the code should be evolved and how, for instance: description of loss functions or search paths to consider. Together with the seed code, the system prompt forms the initial input to the program database sampling process: in effect, it determines both where the search begins and in which direction it is steered.

    The fifth input defines the evolution blocks: contiguous code regions marked with special comments (# EVOLVE-BLOCK-START / # EVOLVE-BLOCK-END) that AE is permitted to modify, while the rest of the codebase remains intact. Carefully scoping these blocks is important — they must be consistent with inter-function dependencies and library imports in the surrounding fixed code. Additional constraints, such as the set of Python libraries available or explicit enumeration of architectural ideas to try, can be incorporated via docstrings or the system prompt.

    The sixth and final input is a stopping criterion — a maximum number of iterations or a target metric threshold — bounds the experiment.

    Inner workings

    At the core of AlphaEvolve is an evolutionary database inspired by the MAP-elites algorithm [2] and island-based population models [3, 4]. Candidate programs and their scores are stored and curated in this database, which is continuously updated as new solutions are evaluated. The database is designed to balance two competing objectives: exploration, by maintaining diversity across the solution space, and exploitation, by prioritising refinement of the highest-scoring programs.

    To generate diverse candidate proposals, AE relies on a mixture of large and small language models — in our experiments, a combination of Gemini 3 Pro and Gemini 3 Flash — which sample from the current program database alongside the seed code and system instructions to produce modified candidate programs. Gemini 3.0 Flash, with its lower latency, enables a higher rate of candidate generation, increasing the volume of ideas explored per unit of time, while Gemini 3.0 Pro provides occasional higher-quality suggestions that can significantly advance the search.

    When an LLM is asked to modify existing code within a larger codebase, it outputs changes as a sequence of SEARCH/REPLACE diff blocks in a structured format — specifying exactly which code segment to replace and with what — allowing for targeted, composable modifications. In cases where the evolved code is very short or a complete rewrite is appropriate, AE can instead instruct the LLM to output the full code block directly.

    Two additional mechanisms further enhance the quality of the search:

    • The first is meta-prompt evolution: AE co-evolves the system prompt itself through a separate LLM step that generates and refines meta-instructions in a dedicated database alongside the candidate programs. This allows the guidance given to the code-generation models to improve over time, potentially surpassing what a human prompter could craft manually — a finding confirmed in ablation experiments reported in the original paper.
    • The second is LLM-generated feedback: beyond the user-provided evaluation function, AE can invoke separate LLM calls to assess qualitative properties of candidate programs that are difficult to measure programmatically, such as code simplicity or readability. These soft scores can steer evolution or filter out undesirable candidates. In our experiments, we observed a beneficial side effect of this mechanism: AE spontaneously added informative comments to complex code segments within evolution blocks that it chose not to modify, improving both readability and the quality of subsequent evolutionary steps.

    Output

    The primary output of an AE experiment is a functional program: a candidate code that achieves the highest target score across all evaluated iterations. Because LLMs occasionally hallucinate, some candidate programs may fail to execute; however, AE handles this gracefully — execution is isolated per candidate, and failed programs are assigned a default score (see next section) while the asynchronous pipeline continues uninterrupted.

    In our experiments, we ran each AE session for up to 1,000 evolutionary iterations, with 4 concurrent candidate evaluations at each step. The final evolved programs, together with all intermediate candidates, can be retrieved post-experiment using their session_id, experiment_id, and program_id identifiers, enabling retrospective inspection of the evolutionary trajectory and diff-level comparison against the seed program.

    Figure 2\ Example of an evolution block suggestion. Here AE suggested a parametric non-linear combination of the Cross Entropy loss addressing class imbalance in dataset V17.

    Best practices when using AE

    AlphaEvolve has a shallow entry-point: the user needs only to supply a seed program, an evaluation function, and a maximum number of iterations to launch an experiment. In practice, however, the quality of the outcomes depends strongly on a set of design choices that span prompt engineering, code instrumentation, metric design, and infrastructure planning. Based on our own experiments, we distilled the following guidelines for practitioners.

    1. Invest in the system prompt

    The system prompt is the single most impactful lever available to the user. A prompt that names concrete directions to explore — particular loss families, regularisation strategies, architectural motifs, or hyper-parameter ranges — consistently produces more diverse and higher-scoring candidates than a short, generic instruction. Experimentally, a vague or ambiguous prompt tended to generate overlapping candidate programs with low structural diversity, which hinders the evolutionary search even though AE’s database design is intended to discourage convergence to local minima.

    This observation is corroborated by the paper’s ablation experiments: removing explicit context from the prompt caused a significant drop in discovery performance across both the matrix multiplication and kissing number tasks. Furthermore, AE supports meta-prompt evolution, as we mentioned earlier. In practice, this means the system can progressively improve its own guidance — but this process is seeded by the human-authored prompt, so a strong starting prompt remains essential. Beyond free-text instructions, the prompt can be enriched with explicit context in the form of equations, code snippets, or even PDF attachments of relevant literature.

    2. Place evolution blocks deliberately

    Evolution blocks (delimited by # EVOLVE-BLOCK-START / # EVOLVE-BLOCK-END markers) define the search space (Figure 2). Code outside these markers is treated as a fixed skeleton. Several considerations arise in practice:

    • Scope blocks to the right level of abstraction. AE can evolve a single function, several functions together, or an entire codebase. Broader blocks give AE more latitude but also increase the chance of inconsistencies across function boundaries. Narrower blocks constrain search but keep the skeleton stable.
    • Ensure interface consistency. If the system prompt permits AE to introduce new functions, the evolution block must cover all call sites; otherwise, the fixed code will reference function names that do not exist in the evolved program. This was one of the most common sources of crashing candidates in our experiments.
    • Expose hyper-parameters. Allowing AE to evolve batch size, learning rate, network dimensions, and weight decay treated these as first-class search variables rather than fixed settings. All three of our best-performing variants had hyper-parameters discovered by AE rather than by grid search.
    • Consider the abstraction level carefully. The paper notes that for problems with highly symmetric solutions, evolving a constructor function (which builds the solution from scratch) tends to be more effective than evolving the solution directly. For asymmetric problems — such as our neural architecture — evolving the implementation itself is usually more appropriate, so we opted for that.

    3. Design the evaluation function with care

    The evaluation function is the primary feedback signal for the evolutionary loop. Several properties matter:

    • Speed. Each candidate program must be evaluated before its score can propagate back into the database. Slow evaluations directly limit the number of generations that can be completed within a fixed compute budget. In our case, training a neural network on each candidate introduced variable evaluation times depending on architectural complexity — a cost we accepted in exchange for richer feedback.
    • Default score for failures. If the target metric is non-negative and monotonically increasing, assign a default score of -1 or (-♾️) to candidates that crash or produce invalid output. This prevents degenerate programs from polluting the database with artificially neutral scores.
    • Metric choice. Use a single scalar as the primary optimisation target. For imbalanced classification, we found that positive-class F1-score led to better evolved solutions than accuracy, because it forced AE to address minority-class performance. When multiple qualities matter simultaneously, a weighted linear combination of scalars can be used. Importantly, the paper notes that optimising multiple metrics often improves single-metric results as well: diverse, high-performing programs across different criteria expose the LLMs to a broader solution space, increasing the chance of discovering novel approaches for the target metric.
    • Evaluation cascade. AE supports multi-stage hypothesis testing in which new candidates are first screened on simpler or smaller-scale inputs before being subjected to the full evaluation. This prunes clearly faulty or unpromising programs early and is particularly valuable when full evaluations are expensive. We observed that target scores improved gradually but steadily over hundreds of iterations, with solution complexity increasing as the search matured — a pattern consistent with the cascade’s role in maintaining a healthy exploration–exploitation balance. So run for at least a few hundred iterations. AE’s evolutionary database requires time to accumulate a diverse pool of high-quality programs from which future generations can be sampled.

    4. Manage experiments actively

    • Monitor intermediate candidates. During execution, target scores of successful candidates are visible in real time. Periodically inspecting the diff between the seed program and a high-scoring intermediate candidate — using a tool such as a code editor’s diff view (see Figure 2)— can reveal whether the system prompt is effective, whether evolution blocks are correctly placed, and whether the candidate programs are structurally sound. Early detection of issues saves significant compute.
    • Always verify solutions manually. LLMs can hallucinate plausible-looking but subtly incorrect code. In one of our experiments, a candidate that achieved a high target score did so by inadvertently overfitting: the missing model.eval() / model.train() calls meant that dropout and batch normalisation were applied incorrectly at evaluation time. Human review of the top-performing programs before committing to them is essential.
    • Seed subsequent experiments from the best prior result. Once an experiment concludes, launching a new run seeded with the best-performing program — while retaining the same system prompt — allows AE to continue refining from a strong starting point rather than restarting from the original seed.

    5. Plan infrastructure appropriately

    Choose a Virtual Machine (VM) whose hardware matches the demands of the evaluation function. For neural architecture search, a GPU-accelerated machine substantially reduces per-evaluation training time. More critically, evolved programs can introduce architectural changes — wider layers, additional attention heads, larger batch sizes — that increase memory consumption beyond what the seed program required. Out-of-memory errors during evaluation stall the experiment without providing useful signal. Sizing RAM and VRAM conservatively against the maximum plausible candidate complexity, rather than the seed’s footprint, prevents this failure mode.

    • The user can invoke AE API as another Google Cloud Platform (GCP) service. There is flexibility to call the API (user account with appropriate permissions is set up once) provided that the user is authenticated via gcloud-cli. There are two options to create an experiment: 1) create, start and connect to a virtual machine (VM) on a GCP project, so that cloud resources are utilized (recommended), 2) run locally, using a proprietary machine.

    6. Guard against memory leaks in the fixed code skeleton

    AE executes each candidate program by adding it to an asynchronous queue, where it is run inside the user-provided evaluation function. A key responsibility boundary applies here: AE is only in control of the code inside the evolution blocks. Everything outside — the fixed skeleton, data loading, model initialisation, and the evaluation function itself — is entirely the user’s code and runs as-is for every single candidate evaluation across potentially thousands of iterations.

    This makes memory management in the fixed code important. In our experiments, we encountered a case where a fixed function outside the evolution blocks was not properly releasing memory objects after each evaluation call. Because this code ran once per candidate — and experiments run for hundreds or thousands of iterations — the leak caused memory objects to accumulate steadily, eventually leading to experiment failure.

    To prevent this, we recommend the following hygiene practices for all code outside evolution blocks:

    • Explicitly empty GPU/CPU caches after each evaluation call (e.g. torch.cuda.empty_cache() in PyTorch).
    • Delete intermediate objects that are no longer needed using del, particularly large tensors or model instances created within the evaluation loop.
    • Avoid global variable declarations outside the evaluation function — globals persist across calls and are a common source of unintended object retention.
    • Add explicit garbage collection calls (e.g.import gc; gc.collect()) at the end of the evaluation function, especially when dealing with large objects or when GPU memory is constrained.

    The broader principle is to treat the evaluation function as a self-contained, stateless unit: everything it allocates should be released before it returns.

    7. Choose codebase style

    The codebase can be either self-contained in a Jupyter notebook or in the form a repository-style. In the first case the seed program is provided as a long string whereas in the latter the user indicates the file(s) to be evolved. As mentioned Evolution blocks are indicated by special markers as comments (# EVOLVE-BLOCK-START and # EVOLVE-BLOCK-END). Repository-style codebases are easier to track using version control and by using IDE file diff functionality for the candidate programs.

    Base models

    Prediction model

    The base prediction model is a neural classifier that takes pre-computed feature embeddings as input and assigns each sample to one of three campaign performance classes: positive, average, or negative. Architecturally, it consists of modality-specific encoding layers that project the input embeddings into a lower-dimensional latent space, followed by a fusion layer that combines these projections to produce the final class prediction. Through systematic experimentation across a range of loss functions and their combinations, we identified a weighted mixture of classification and alignment losses as the best-performing training objective. The model is optimised using AdamW (Adam with weight decay) and employs early stopping to prevent overfitting. To rigorously assess the strengths and limitations of each model variant, we constructed a suite of synthetic datasets spanning a range of difficulty levels — varying in complexity, class cardinality, dataset size, and noise level.

    Recommendation model

    The recommendation model is used to ‘fill in the missing puzzle pieces’. A user provides a partial query with missing feature values and the model completes the remaining with appropriate values based on past successful campaign knowledge. The recommendation model relies on the prediction model as an input, where this knowledge is distilled in the model weights. We compile indices for quick look-up and use the top-K recommendations based on approximate K-NN search in the projection latent space. A design decision was to run separate, independent optimisation experiments for each model rather than a joint optimisation, making sequential rather than simultaneous optimisation the natural approach. For the synthetic datasets where the ground truth known and represented as a graph, we devised a metric to score the success of good candidates. This metric considers empty recommendations and at the same time penalizes for low-performing ones. We refer to the normalized value of this metric as avg norm score (Table 3).

    Results

    Inspecting the top performing prediction programs

    We present the main points for the three top performing AE programs of multiple experiments for the prediction model. Table 1 summarizes the differences with the base program (seed).

    • Upon launching a plethora of AE experiments using different input datasets and refining the system prompt accordingly, we chose three best performing variants and use the naming convention Centroid_Loss, Cross_Modal_Attn and Focal_Loss henceforth (Table 1).
    • All three AE-discovered variants abandoned the base model’s shared encoder design, opting instead for separate, identical encoders for each input modality — suggesting the evolutionary search consistently converged on specialised, non-shared representations that were initially used to balance network capacity with data size.
    • Each variant introduced a distinct custom loss function replacing the base contrastive loss: Centroid_Loss uses Centroid Loss, Cross_Modal_Attn uses a Hyper-spherical centroid loss (used for inter-sample and intra-sample alignment), and Focal_Loss a Cosine Centroid Attraction loss (on the positive class) — indicating that loss function innovation was the primary lever AE exploited to improve performance.
    • Cross_Modal_Attn introduced the most complex architectural changes: Beyond the loss, Cross_Modal_Attn added a gating layer on the encoder, label smoothing, and an entanglement-based classifier with cross-modal attention — making it the most structurally different from the base model among the three. On the contrary,Centroid_Loss focused on regularisation and normalisation refinements: Its changes (layer norms and different activation functions) are relatively conservative compared to Cross_Modal_Attn/Focal_Loss, suggesting that even lightweight, targeted modifications discovered by AE can yield meaningful gains.
    • Cross_Modal_Attn and Focal_Loss also introduced architectural innovations beyond the loss, with Cross_Modal_Attn adding a gating layer on the encoder and an entanglement-based classifier with modal attention, and Focal_Loss incorporating Focal loss (targeting class imbalance) and batch normalisation in the classifier — showing AE’s ability to co-evolve both loss and architecture simultaneously.
    • Hyper-parameter tuning was itself part of the evolved solution: all three variants had learning rate and network dimensions evolved by AE, while Centroid_Loss and Focal_Loss additionally evolved weight decay (optimizer) — demonstrating that AE treated hyper-parameters as first-class search variables, not just fixed settings. We should note here that hyper-parameter tuning on our base model was done using grid search.
    • The three variants were evolved on different target datasets (Centroid_Loss & Cross_Modal_Attn on V16, Focal_Loss on V17), reflecting a deliberate strategy of running separate AE experiments tailored to datasets of varying difficulty, rather than a single universal optimisation. Nevertheless Centroid_Loss and Cross_Modal_Attn performance is similar accross datasets (Table 2) indicating robustness.
    base modelCentroid_LossCross_Modal_AttnFocal_Loss
    encoder layershared + dedicated for GEOseparate identical encodersseparate identical encodersseparate identical encoders
    extra Losscontrastive with marginCentroid LossHyper-spherical centroid lossCosine Centroid Attraction loss
    changeslayernorm’sgating layer on encoder, label smoothing in LossFocal loss (for class imbalance)
    ClassifierMLPdifferent activation functionsLR, network dimsadded batch normalization
    Hyper-param tuninggrid searchLR, network dims, weight decayentanglement-based with cross-modal AttentionLR, network dims, weight decay
    Target datasetV16V16V17
    Table 1 \ Prediction model best performing variants discovered using AlphaEvolve. Base model: our own proprietary model serving as seed model. Centroid_Loss, Cross_Modal_Attn, Focal_Loss belong to separate AE experiments where they reached the highest target score after hundreds of iterations.

    Results on Prediction

    datasetmetricbase modelCentroid_LossCross_Modal_AttnFocal_Loss
    V15avg F1-score90.22%92.65%93.09%89.60%
    easy, imbalancedNEG F1-score90.45%92.20%92.12%90.70%
    AVE F1-score97.50%98.02%98.16%96.85%
    POS F1-score82.89%87.74%88.99%81.25%
    extra V16avg F1-score78.40%85.92%85.72%71.22%
    med, imbalancedNEG F1-score71.57%81.30%80.54%59.91%
    AVE F1-score95.00%96.53%96.41%88.79%
    POS F1-score68.61%79.92%80.20%64.96%
    V17avg F1-score30.32%30.34%33.33%34.91%
    hard, imbalancedNEG F1-score0%0%0.26%15.83%
    AVE F1-score90.98%90.97%90.04%63.50%
    POS F1-score0%0.06%9.70%25.39%
    V25avg F1-score86.48%89.75%89.48%81.65%
    balancedNEG F1-score89.37%91.30%91.30%87.42%
    AVE F1-score82.78%86.66%86.16%71.79%
    POS F1-score87.30%91.28%90.98%85.73%
    V26avg F1-score85.28%89.11%87.68%79.30%
    balancedNEG F1-score89.75%91.11%88.88%83.49%
    AVE F1-score80.90%86.02%84.43%67.46%
    POS F1-score85.18%90.22%89.71%86.95%
    Realavg F1-score63.00%71.00%71.12%57.44%
    imbalancedNEG F1-score52.00%63.74%62.78%56.53%
    AVE F1-score82.00%85.21%85.37%59.70%
    POS F1-score56.00%64.33%65.20%56.10%
    Accuracy72.00%77.11%77.37%57.75%
    Table 2 \ Prediction model score comparison across dataset and model variants. Base model: seed model, var. Results are reported as: macro-average f1 (avg f1-score) and per-class f1-score (NEG, POS, AVE). Centroid_Loss and Cross_Modal_Attn are consistently well performing across datasets and all variants outperform the base model. The standard deviation of reported means is ± 0.005%.

    Key Findings

    • AE-evolved variants consistently outperform the already well-performing base model across almost all datasets and metrics: Centroid_Loss and Cross_Modal_Attn show gains on V15, V16, V25, and V26, confirming that AE’s evolutionary search reliably identifies architectures superior to our baseline.
    • Cross_Modal_Attn is the best performer on easy/medium synthetic data (V15, V16): It leads on avg F1 for V15 (93.09% vs. base 90.22%) and ties Centroid_Loss on V16 (85.72% vs. 85.92%), while also achieving the highest POS F1 on V16 (80.20% vs. base 68.61%) — a striking +11.6pp improvement on the hardest-to-classify minority class.
    • Focal_Loss is uniquely effective on the hardest dataset (V17): While the base model and Centroid_Loss score 0% on both NEG and POS F1 (essentially collapsing to a degenerate classifier), Focal_Loss achieves 15.83% NEG and 25.39% POS F1 — breaking through a performance floor that the other variants could not overcome.
    • The minority class (NEG/POS) gains are consistently larger than the average class (AVE) gains: For example on V16, the base AVE F1 is already 95.00% and Centroid_Loss only improves it to 96.53%, whereas POS F1 jumps from 68.61% to 80.20%. This pattern repeats across datasets, indicating AE’s variants are particularly effective at addressing class imbalance. This is particularly important as correct identification of the positive class (well performing campaigns) is critical for the recommendation model.
    • Centroid_Loss and Cross_Modal_Attn show similar and robust performance across datasets: Despite being evolved independently on V16, both variants score comparably on V15, V16, V25, and V26, suggesting AE converged on a similarly generalizable solution space from different starting conditions — evidence of stability in the evolutionary search.
    • Centroid_Loss is the strongest all-round performer on balanced/moderate synthetic data (V25, V26): It leads avg F1 on both V25 (89.75% vs. base 86.48%) and V26 (89.11% vs. base 85.28%), consistently outperforming Cross_Modal_Attn and Focal_Loss on these datasets. On the contrary Focal_Loss underperforms on all datasets except V17, suggesting it was over-specialised for the hard imbalanced regime and does not generalise well to easier distributions.
    • Gains on real data are the most practically significant: On the real-world imbalanced dataset, Centroid_Loss achieves a +8pp avg F1 (71% vs. 63%), +11.74pp NEG F1 (63.74% vs. 52%), +8.33pp POS F1 (64.33% vs. 56%), and +5.11pp accuracy (77.11% vs. 72%) — the largest absolute improvements across all tested datasets, validating that AE’s gains hold on actual production data. Cross_Modal_Attn attains a comparable improvement as well.

    Results on Recommendation

    datasetmetricbase Predictor/ base RecommenderAE Predictor/ base RecommenderAE Predictor/ AE Recommender

    V15
    easy, imbalanced
    avg normalized score0.43580.46410.5019

    V16
    med, imbalanced
    avg normalized score0.30940.33970.3973
    V17 hard, imbalancedavg normalized score0.00.29150.3615
    V25 balancedavg normalized score0.51010.57930.6055
    V26 balancedavg normalized score0.51690.55350.5857
    Table 3 | Recommendation model score comparison across dataset and model variants. The average normalized score is based on the ground truth graph and penalizes for negative/empty recommendations (higher is better). Base Predictor is the base model from Table 1. AE Predictor is the best AE model, for each dataset (Table 2). AE Recommender is the best AE-evolved candidate based on our Recommendation model. There is incremental improvement by combining the AE candidate Prediction and Recommendation models.

    Key Findings

    • Combining AE-evolved components at both stages yields the strongest results across all datasets: The fully AE pipeline (AE Predictor / AE Recommender) consistently achieves the highest average normalised score on every tested dataset — 0.5019 on V15, 0.3973 on V16, and 0.3615 on V17 — confirming that gains from prediction and recommendation evolution are additive.
    • The AE Predictor alone delivers meaningful gains over the baseline pipeline: Even when paired with the base Recommender, replacing the base Predictor with its AE-evolved counterpart improves the normalised score by +6.5% on V15 (0.4358 → 0.4641), +9.8% on V16 (0.3094 → 0.3397), and lifts V17 from a score of 0.0 to 0.2915 — demonstrating that prediction quality is a critical upstream bottleneck for recommendation performance.
    • The hardest dataset (V17) shows the most dramatic absolute improvement: The base pipeline scores 0.0 on V17, meaning it produces no valid recommendations under the most challenging data regime. The AE Predictor alone breaks this failure floor (0.2915), and the fully AE pipeline extends it further to 0.3615.
    • The recommendation model benefits from AE evolution beyond what better predictions alone provide: On all three datasets, the jump from (AE Predictor / base Recommender) to (AE Predictor / AE Recommender) is consistent and substantial — +0.0378 on V15, +0.0576 on V16, +0.0700 on V17 — indicating that the recommendation model itself harbours independent optimisation headroom that AE successfully exploits.
    • Gains scale with dataset difficulty: The absolute improvement of the fully AE pipeline over the all-base baseline increases as difficulty grows — +0.0661 on V15, +0.0879 on V16, and +0.3615 on V17 (from zero). This suggests AE’s evolutionary search is particularly valuable in challenging, noisy data regimes where conventional optimisation approaches stall.

    Conclusion

    • The results are compelling: the evolved models achieved an improvement in prediction accuracy of up to 10% over the baseline model.
    • Prediction: No single variant dominates across all datasets: Cross_Modal_Attn leads on V15, Centroid_Loss leads on V25/V26/Real, and Focal_Loss is indispensable for V17 — implying that for a production deployment spanning diverse data regimes, an ensemble or dataset-aware model selection strategy would be optimal.
    • Recommendation: The evolved prediction model alone delivers meaningful gains over the base Recommendation. Combining the best prediction and Recommendation models improves recommendation scores even further.

    This project was a collaboration between the WPP Research team including: Anastasios Tsourtis and Theodoros Lappas and the AI for Science team at Google Cloud including (but not limited to): Kartik Sanu, Laurynas Tamulevičius, Nicolas Stroppa, Chris Page, Gary Ng, John Semerdjian, Skandar Hannachi, Vishal Agarwal, and Anant Nawalgaria, Gabriela Hernandez Larios and partners at Google DeepMind

    References

    1. Novikov, A., Vũ, N., Eisenberger, M., Dupont, E., Huang, P.-S., Wagner, A. Z., Shirobokov, S., Kozlovskii, B., Ruiz, F. J. R., Mehrabian, A., Kumar, M. P., See, A., Chaudhuri, S., Holland, G., Davies, A., Nowozin, S., Kohli, P., & Balog, M. (2025). AlphaEvolve: A coding agent for scientific and algorithmic discovery. arXiv:2506.13131 [cs.AI]. https://arxiv.org/abs/2506.13131
    2. Mouret, J.-B., & Clune, J. (2015). Illuminating search spaces by mapping elites. arXiv:1504.04909 [cs.AI]. https://arxiv.org/abs/1504.04909
    3. Romera-Paredes, B., Barekatain, M., Novikov, A., Balog, M., Kumar, M. P., Dupont, E., Ruiz, F. J. R., Ellenberg, J. S., Wang, P., Fawzi, O., Kohli, P., & Fawzi, A. (2024). Mathematical discoveries from program search with large language models. Nature, 625(7995), 468–475. https://doi.org/10.1038/s41586-023-06924-6
    4. Tanese, R. (1989). Distributed genetic algorithms for function optimization. University of Michigan.
  • Cracking the Code of Campaign Success with Google’s AlphaEvolve Agent

    In the fast-paced world of digital marketing, one deceptively simple question keeps resurfacing: “What knowledge can we extract from successful past campaigns to make better future marketing decisions?”

    Every brand sits on a goldmine of historical campaign data: thousands of images, videos, and overall campaign configurations that either soared or sank. The challenge isn’t a lack of information; it’s injecting that knowledge at the precise moment the next decision is being made. How do we operationalise lessons learned to answer questions like:

    • Prediction
      Given Brand A, the target region of São Paulo, a set of creatives featuring outdoor sports imagery, and audience group of millennials aged 25–34, how well is the campaign expected to perform? “
    • Recommendation
      “Given Brand B and a target region of Milan, what should the creatives (videos/images) look like to maximise engagement among environmentally conscious consumers aged 15–18?”

    A common suggestion is to simply “ask an AI.” While modern Large Language Models (LLMs) are remarkably capable and encode broad real-world knowledge, they lack the tribal knowledge embedded in your proprietary data. They don’t know your specific brand voice, your audience’s unique quirks, or the subtle patterns behind your past failures. To truly win, you need a system that learns from your history—the hits, the misses, and everything in between.

    To address this, the WPP Research invests significant effort in developing prediction and recommendation models trained on large and diverse volumes of historical campaign data. These models are highly competitive and continuously improving. However, at some point during development, progress inevitably hits a plateau: even incremental gains—rarely exceeding 1%—demand extensive bibliographic research, days or even weeks of trial-and-error experimentation, and painstaking fine-tuning.

    With time at a premium, a vast space of possible improvements to explore (architectural changes, hyperparameter tuning), and experiments that are inherently slow to run, we turned to Google’s AlphaEvolve (AE) [1]: a Gemini-powered agentic framework that reframes model development as an evolutionary search problem. Rather than relying on manual experimentation, AlphaEvolve autonomously proposes, evaluates, and refines candidate model architectures in an iterative loop, guided by the expertise of our Data Science team and grounded in objective performance metrics.

    The results are striking: what weeks of manual experimentation struggled to improve by a single percentage point, AlphaEvolve achieved in a fraction of the time, delivering prediction accuracy gains of up to 10% on both synthetic and real datasets, while simultaneously lifting downstream recommendation scores up to 7%.

    Our access to AlphaEvolve came through Google’s Early Access Program (EAP), within the context of the ongoing partnership between Google and the WPP Research. Throughout our adventures with AlphaEvolve, we have been collaborating closely with the Google Research team, providing and receiving feedback. This collaboration has been invaluable to the project’s success.

    The AlphaEvolve Advantage

    Building a good AI model is painfully slow. A team of experts reads through mountains of research papers, rewrites code by hand, and runs experiments that can take days, only to find the improvement is tiny, or worse, a dead end. This research → code → test → repeat cycle creates a huge gap between having data and actually getting value from it.

    And even after you pick a model architecture, you still have to tune it. Think of it like adjusting the equalizer on a stereo: dozens of sliders, each affecting the sound, and you’re trying to find the perfect combination by ear. Techniques like grid search and Bayesian optimization help, but they’re still limited by what the human designer guesses might work. Not what the data actually needs. Trying every possible combination? Far too expensive and slow.

    The honest truth is that the search space is simply too vast for human intuition and trial-and-error to navigate. This is exactly where AlphaEvolve (AE) changes the game.

    Instead of a person manually tweaking one model at a time, AE treats the entire development process as an evolutionary search. Much like natural selection, but for code. It generates candidate models as functional programs, runs them, and scores each one against a target metric. It doesn’t just tune models. It designs them from scratch.

    Under the hood, AE is powered by Google’s state-of-the-art Gemini model, working hand-in-hand with a curated program database from Google DeepMind. Together, they explore millions of possible code configurations, zeroing in on the most accurate solution that meets our constraints. A search of this breadth would take a human team months. AlphaEvolve does it in a fraction of the time.

    By shifting from manual experimentation to this autonomous framework, we don’t just speed things up. We uncover strategies and architectures that human intuition alone would never find. Figure 3 illustrates this iterative loop in action.

    Figure 3 \ AlphaEvolve is a Gemini-powered coding agent from Google that automatically improves algorithms through a “generate, test, and refine” loop. The user provides three inputs: a description of the problem, a way to score candidate solutions, and a starting program to build from. AlphaEvolve then proposes many code variations using Gemini, scores each one automatically, and keeps the best-performing ideas—recombining and evolving them over multiple rounds, much like natural selection. With each cycle, the solutions get sharper, often surpassing what the original starting point could achieve.

    Guided Evolution: The Human in the Loop

    AlphaEvolve is autonomous, but it is not unsupervised. Think of it as digital evolution: the AI proposes ideas and keeps only the winners to build upon in the next generation. This process still requires careful navigation by our Data Scientists, who provide clear system instructions and constraints to guide the search through an infinite landscape of potential improvements, while inspecting for deviations introduced by the stochastic nature of LLMs. The result is a search that stays focused on logical, high-quality architectures and respects the real-world boundaries of the problem we are addressing.

    In the example below, we illustrate the inputs that AE expects from the human in the loop, as well as the output that it produces.

    Input 1: A System prompt describing the problem and steer evolution towards search directions.

    An example system prompt is: “Evolve a training model for a neural network 3-class classifier that achieves high accuracy on a provided dataset. The model must consist of a loss function that… . Focus on the multi-objective optimization of the following scores… Consider changing the model architecture to include…

    Think of the System Prompt as the instruction manual you hand to AlphaEvolve before it starts work. Imagine hiring a highly skilled but very literal engineer. They’re brilliant, but they need a clear, written brief to work from — they won’t assume anything. The System Prompt is that brief. It channels AlphaEvolve’s enormous computational power toward the right problem, in the right direction. It covers:

    • What the job is — e.g., “Build a model that can classify campaign outcomes into three categories.”
    • What the rules are — constraints it must respect, such as how the input data is structured or what the model architecture must look like.
    • Where to focus — specific areas to explore and improve, for example: “Try changing the loss function”.
    • What success looks like — the specific performance goals it should be optimising for (e.g., accuracy scores). This is also why the human expertise of the Data Science team remains critical.

    Input 2: A Seed Program with an initial solution that you hand to AlphaEvolve to improve.

    Rather than asking AlphaEvolve to build something from scratch, you give it a model that already works — and ask it to make it better. The team deliberately marks which parts AlphaEvolve is permitted to experiment with (using special labels in the code), and which parts must remain untouched. The Seed Program represents the accumulated expertise and investment already put into your AI models. AlphaEvolve doesn’t throw that away — it builds on top of it. It’s the difference between renovating a solid building versus demolishing it and starting over.

    Input 3: The Target metric that AE will attempt to maximise in order to achieve our objective.

    The Target Metric is essentially how the business defines “better.” This is a critical decision made by the Data Science team — not the AI. If the metric is well-chosen, AlphaEvolve will find solutions that genuinely deliver business value. If it’s poorly defined, the AI could optimize for the wrong thing entirely. Imagine you’re running a sales team and you’ve set a clear goal: maximise the conversion rate. Every change your team tries — new pitch, new pricing, new outreach method — gets evaluated against that one number. If a change improves the conversion rate, you keep it. The Target Metric works exactly the same way for AlphaEvolve. It might be something like “predict campaign performance as accurately as possible” — expressed as a single numerical score. AlphaEvolve runs each candidate model, checks the score, and keeps only the ones that do better. So the Target Metric is the objective, measurable definition of what winning looks like.

    Input 4: The Stopping criteria.

    The Stopping Criteria is simply the pre-agreed rule for when to call it done. Since AlphaEvolve could theoretically keep running and experimenting forever, the team sets clear boundaries upfront for when the experiment should end. A maximum number of rounds — e.g., “Run up to 500 iterations, then stop.” A performance threshold — e.g., “Stop as soon as the model reaches 90% accuracy.” This is like saying: “Once we’ve hit our goal, there’s no need to keep going.”

    Output: a ranked list of improved AI models.

    Figure 4 shows a ‘before’ (left) and ‘after’ (right) comparison of a section of a seed program that AlphaEvolve was asked to improve. Changes are highlighted in green. We observe several changes:

    • Training parameters were upgraded. For example, the number of training cycles (EPOCHS) was increased, the model’s internal size (PROJ_DIM) grew and a regularisation setting (WEIGHT_DECAY) was adjusted. These are the kind of fine-tuning decisions that would normally take a data scientist considerable time and experimentation to arrive at.
    • The model’s internal logic was redesigned. The component responsible for processing data (the “encoder”) was restructured and even renamed to better reflect its purpose. AlphaEvolve didn’t just tweak numbers. It proposed a more sophisticated architecture. New techniques were introduced.
    Figure 4 \ Example of an evolved block of code where AE is permitted to modify the contents of this segment. The function contents are modified and changes in names are reflected in other code blocks appropriately. Note that training parameter values are suggested as well indicating compatible architectural changes with hyperparameter tuning.

    Results: does it actually work?

    AlphaEvolve was applied to two core problems:

    • Performance Prediction, which estimates a campaign’s performance based on its configuration.
    • Performance-aware recommendation, which suggests the optimal way to complete/update a campaign’s configuration, in order to maximise its performance.

    Both models had already reached a highly competitive baseline with further manual improvements stalling below 1%.

    Datasets

    We evaluated all models on a suite of six datasets: five synthetic (details using an internally developed pipeline can be found here) and one real-world. This yielded datasets spanning a range of regimes: easy/medium/hard, depending on the noise profile and class balance – classes with a fewer samples are characterized as minority. Easy and imbalanced (V15), medium and imbalanced (V16), hard and imbalanced (V17), medium and balanced (V25, V26). The real-world dataset consists of actual historical campaign records and serves as the ultimate validation of whether gains observed on synthetic data transfer to production conditions.

    Prediction

    Three top-performing AE-evolved variants (Centroid_Loss, Cross_Modal_Attn, Focal_Loss) were identified across multiple experiments. All three consistently outperformed the base model across synthetic and real-world datasets.

    In order to asses the model performance we use the industry standard F1 score that is a way of measuring how good a model is at classification (class ‘POS’ is high-performing, class ‘NEG’ is low-performing, class ‘AVG’ average-performing) , balancing two things: Precision — “When the model says something is positive, how often is it right?” Recall — “Out of all the actual positives, how many did the model catch?” If the model is good at one but terrible at the other, the F1 score will be low. We calculate the F1 score separately for each class (NEG, AVE, POS), then take the plain average avg F1-score.

    • On easy/medium synthetic data (V15, V16): Cross_Modal_Attn achieved the strongest overall performance, reaching 93.09% avg F1-score on V15 (vs. 90.22% baseline) and a striking +11.6 percentage point improvement on the hardest-to-classify minority class POS on V16 (POS F1: 80.20% vs. 68.61%).
    • On the hardest synthetic dataset (V17): Focal_Loss broke through a performance floor that other variants could not — the base model scored 0% on both minority classes (NEG and POS), while Focal_Loss achieved 15.83% and 25.39% respectively.
    • On real-world data: Centroid_Loss delivered the most practically significant gains — +8pp avg F1 (71% vs. 63%), +11.74pp NEG F1, +8.33pp POS F1, and +5.11pp accuracy — validating that AE’s improvements hold on actual production data.

    Across all datasets and variants, gains on minority classes (correctly identifying hig-performing and low-performing campaigns) were consistently larger than gains on the majority class — a particularly valuable outcome given that minority-class accuracy is the critical input for the recommendation model.

    Recommendation

    The recommendation model, which relies on the prediction model’s outputs, was evaluated both in isolation and in a fully evolved end-to-end pipeline. The recommendation score (higher is better) is a metric that measures how good the recommendations are by comparing them against a known “ground truth” (applicable to synthetic datasets). It rewards recommendations that correctly identify high-performing campaign configurations, whereas it penalizes two kinds of failures: i) empty (the model couldn’t suggest anything) ii) of low-quality (the model suggested something, but it performs poorly).

    • Swapping in the AE-evolved predictor alone improved recommendation scores meaningfully: +6.5% on easy data (V15), +9.8% on medium data (V16), and lifted the hard dataset (V17) from a score of 0.0 (which essentially means that all recommendations were wrong) to 0.29.
    • Combining the AE-evolved predictor with an AE-evolved recommender produced the strongest results across all datasets, with the fully evolved pipeline achieving scores of 0.5 (V15), 0.4 (V16), and 0.36 (V17) — confirming that the gains from prediction and recommendation evolution are additive.
    • Recommendation improvements of up to 7% were observed when both components were evolved together.

    Conclusion

    AlphaEvolve works — and it works exceptionally well. It represents a meaningful and measurable step forward in model development. Applied to WPP AI Lab’s campaign prediction and recommendation models, which had already reached a performance plateau through conventional means, AlphaEvolve delivered prediction accuracy gains of up to 10% on both synthetic and real datasets, while simultaneously lifting downstream recommendation scores by up to 7%. It surfaces architectural strategies and configurations that lie beyond the reach of human intuition alone, not by replacing the expertise of our Data Science team, but by amplifying it. The human-in-the-loop dynamic remains essential: our scientists shape the search space, define meaningful constraints, and validate the outputs.

    AlphaEvolve does the heavy lifting of exploration. As prediction and recommendation models continue to grow in complexity, AlphaEvolve offers a glimpse of a future where the gap between data collection and model improvement is measured in hours rather than weeks, and where the best-performing systems are not just built by experts, but co-designed with AI.

    This project was a collaboration between the WPP Research team including: Anastasios Tsourtis and Theodoros Lappas and the AI for Science team at Google Cloud including (but not limited to): Kartik Sanu, Laurynas Tamulevičius, Nicolas Stroppa, Chris Page, Gary Ng, John Semerdjian, Skandar Hannachi, Vishal Agarwal, and Anant Nawalgaria, Gabriela Hernandez Larios and partners at Google DeepMind

    References

    1. Novikov, A., Vũ, N., Eisenberger, M., Dupont, E., Huang, P.-S., Wagner, A. Z., Shirobokov, S., Kozlovskii, B., Ruiz, F. J. R., Mehrabian, A., Kumar, M. P., See, A., Chaudhuri, S., Holland, G., Davies, A., Nowozin, S., Kohli, P., & Balog, M. (2025). AlphaEvolve: A coding agent for scientific and algorithmic discovery. arXiv:2506.13131 [cs.AI]. https://arxiv.org/abs/2506.13131

    Ready to explore the specifics? Read our full technical deep dive into the technical report for a closer look at our methodology.

    Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP Research team.