Author: Eirini Kolimatsi

  • Campaign Performance Modelling Pod

    Campaign Performance Modelling Pod

    Effectively predicting digital marketing campaign outcomes is challenged by the complex, multimodal nature of campaign data. To advance this, we developed multimodal fusion models for multiclass campaign outcome classification, rigorously evaluated using a Synthetic Data Generator to create controlled datasets. Benchmarking a LightGBM model against two deep learning architectures, our findings consistently show deep learning models significantly outperform the baseline, especially in noisy and complex environments. The Robust Modality Network, our top performer, maintained over 80% Macro F1 score under challenging conditions, confirming the robustness of embedding based, multimodal deep learning for reliable campaign prediction.

    If you don’t care about the technical details, read our blog post instead.

    From data to growth: Benchmarking modality-aware models for campaign prediction

    Predicting the performance of digital marketing campaigns prior to deployment is an intricate, high- dimensional optimisation problem. Historically, navigating this complex and scattered landscape of decisions has required significant human expertise and iterative adjustments to optimise outcomes. A primary technical hurdle in solving this programmatically is the extreme heterogeneity of the feature space. Campaigns are inherently multimodal, composed of fundamentally different data types: structured demographic data (audiences), unstructured semantic text (brand identity), visual assets (images), and spatial coordinates (geographic locations).

    To address this, we frame campaign performance prediction as a multiclass classification problem, where the objective is to categorise expected outcomes into discrete performance tiers (e.g., Negative, Average, Positive). To process the disparate inputs, we employ multimodal fusion models designed to independently encode these varied data types and project them into a unified, continuous representation space.

    The core question driving our research is whether these advanced machine learning architectures can reliably decode the intricate, cross-channel variables that dictate campaign success.

    Can multimodal fusion models, when rigorously evaluated against controlled synthetic environments, accurately capture and predict the complex, non-linear dependencies that drive marketing performance?

    This documentation details our machine learning strategies designed to transition campaign creation from historical guesswork to data-driven predictive modelling.

    Datasets & encoding

    Dataset

    To evaluate our models effectively, we needed a data source that offered more precision than standard real-world datasets. In many domains, “ground truth” labels can be subjective or noisy, if known at all. To bypass this, we used the Synthetic Dataset Generator, a service that produces labelled datasets with deterministic characteristics.

    By using synthetic data, we can control the difficulty of the task and ensure that the labels (Positive, Average, or Negative) are derived from a known mathematical structure rather than human intuition.

    At the core of our generator is a deterministic graph structure. Each data point is not just a collection of random attributes, but a set of interconnected nodes. The class label for any given entry is determined by the specific edges (relationships) between these nodes.

    The attributes used in our datasets are organized into five distinct modalities:

    • Audience: A composite profile including gender, age groups, and specific interest categories.
    • Brand: A descriptive paragraph outlining the brand’s identity and core values.
    • Creative: A text-based caption describing the visual asset used in the campaign.
    • Geo: A list of specific zip codes representing the target geographic locations.
    • Platform: The delivery channel or platform used for the distribution.

    By adjusting the configuration of the generator, we can manipulate the complexity of the relationships within the graph. The signal strength is defined by the fraction of positive versus negative edges within a data point. By manipulating these thresholds, we can directly tune the separability of the classes, making the generator a flexible tool for benchmarking and stress-testing models under different conditions.

    We generated three distinct dataset versions, of increasing complexity:

    • v25 (strong signal): This is our “cleanest” dataset and serves as the baseline. In this dataset, in each data point, more than 45% of the edges belong to the target class, while noise from the opposite class is capped at 25%. With high separation and minimal interference, this version tests the model’s ability to identify clear patterns.
    • v26 (medium signal): This variant introduces significant “label pollution.” While target class edges can reach up to 90% density, the opposite class noise can rise to 35%. This creates a much noisier environment, particularly for the “Average” rows, which consist of a messy, conflicting mix of both positive and negative edges.
    • v28 (low signal / high volume): This is our most challenging stress test. The signal is weaker compared to the previous datasets, from only 34% to 44% target edges, while noise remains high (12% to 24%). In this set, the “Average” class is characterized by sparsity. Rather than being a mix of signals, these rows contain fewer edges overall, most of which connect to unrelated nodes. This forces the model to learn how to identify the absence of a strong signal rather than just filtering out noise.

    Detailed parameter configurations for the positive, negative, and average datasets are provided in the table below.

    Label Params CategoryParameterV25V26V28
    pospos_frac_range[0.45, 1][0.45, 0.9][0.3387, 0.4387]
    neg_frac_range[0, 0.25][0.1, 0.35][0.1192, 0.2444]
    num_samples18,00018,00024,000
    negpos_frac_range[0, 0.25][0.1, 0.35][0.1035, 0.2259]
    neg_frac_range[0.45, 1][0.45, 0.9][0.3833, 0.4833]
    num_samples18,00018,00024,000
    avgpos_frac_range[0.25, 0.5][0.25, 0.65][0.1035, 0.2259]
    neg_frac_range[0.25, 0.5][0.25, 0.65][0.1192, 0.2444]
    num_samples24,00024,00072,000
    Parameter configuration for dataset versions.

    Encoding

    Since our dataset features range from long-form text descriptions to specific geospatial markers, we employ distinct encoding strategies to transform these modalities into a format our machine learning models can process.

    Text embeddings

    For all text-based fields, we use the text-embedding-005 model (part of the Google Gemini family). We chose high-dimensional embeddings over traditional methods like “one-hot encoding” because they capture semantic relationships. For instance, an embedding model allows the system to recognise that “Instagram Stories” and “Facebook Reels” are semantically similar, even if the raw text doesn’t match. This ensures that the nuances of our campaign metadata are preserved, providing the predictive model with a rich, context-aware starting point.

    Depending on the specific model architecture, we fine-tuned the embedding parameters to match the downstream task:

    • Deep Learning (Arch 1): We use a dimensionality of 768 with the task type set to CLASSIFICATION.
    • LGBM & Deep Learning (Arch 2): We use a dimensionality of 512 with the task type set to SEMANTIC_SIMILARITY.

    Geospatial encoding

    The Geo modality consists of US zip codes. Rather than treating these as simple text strings or categorical IDs, we use Google’s Population Dynamics Foundation Embeddings (PDFM).

    These are dense vector representations designed to encapsulate the complex, multidimensional interactions between human behavior, environmental factors, and local contexts at specific locations. By using PDFM, the model can capture spatial relationships and demographic overlaps that would be impossible to detect using zip codes alone.

    To aggregate our geospatial data, we use a straightforward but effective pooling strategy. Since each campaign can target a list of multiple US zip codes, we retrieve the individual PDFM embedding for each location in the set. We then calculate the centroid (mean) of these vectors to produce a single, final representation for the entire geographic list.

    By averaging the embeddings, we create a unified “geographical profile” that represents the collective demographic and behavioral characteristics of the target area, ensuring the model receives a fixed-size input regardless of how many zip codes are in the list.

    Methodology

    After extensive experimentation with various machine learning approaches, we converged on three primary models to evaluate predictive performance across our synthetic datasets. We use a high-performing tree-based model as our baseline benchmark, alongside two distinct deep learning architectures designed to capture multi-modal relationships.

    Hierarchical LightGBM

    Our baseline approach uses a custom LightGBM implementation that treats the multi-class problem (Negative, Average, Positive) as a two-stage hierarchical classification. By decomposing the problem into two simpler binary tasks, we can optimise the model to better distinguish the “Positive” class from the others.

    The model consists of two separate LightGBM classifiers that run sequentially during inference.

    Stage 1: Identifying the positive class

    The first classifier focuses on separating positive samples (class 2) from the combined pool of negative and average samples (classes 0 and 1 respectively). It is trained on the entire dataset with balanced class weights. During inference, if the model’s predicted probability for Class 2 exceeds a dynamically tuned threshold (pos_threshold), the sample is definitively assigned to the Positive class.

    Stage 2: Distinguishing negative from average

    Any sample not classified as Positive in Stage 1 is passed to the second classifier. This stage is trained exclusively on a subsample of the data (where the true label is not Class 2) to focus on the nuances between Class 0 (Negative) and Class 1 (Average). If the predicted probability for Class 0 is greater than 0.5, it is assigned to the Negative class; otherwise, it is labelled Average.

    Data handling and class drift

    To ensure the model remains robust against distribution shifts, we implemented two key data-processing steps:

    • Categorical encoding: While most features use high-dimensional embeddings, the Platform modality is treated as a categorical value. It passes through a standard LabelEncoder before being fed into the LightGBM models.
    • Addressing class drift: To prevent the model from overfitting to a majority class, we monitor the target distribution during training. If the majority class cardinality exceeds the minority by more than 1.4x, we apply a selective downsampling strategy. This caps the majority class size and ensures the gradient boosting process isn’t biased toward the most frequent labels.

    Shared latent space (Deep Learning Arch1)

    The first deep learning architecture is designed to project all five modalities (regardless of their original format) into a unified, shared latent space. The goal is to ensure that related features (like a specific “Brand” and a specific “Geo”) that perform well together are placed close by each other, before the model attempts to classify them.

    The projection architecture

    Each modality begins by passing through its own dedicated Projection Block. This stage standardizes the varied input dimensions into a consistent representation using a sequence of:

    • Linear Layer: Mapping the input to a common projection dimension.
    • ReLU Activation: Introducing non-linearity.
    • LayerNorm: Ensuring stable gradients and consistent scaling across different feature types.

    Cross-modal alignment (similarity loss)

    To force the model to learn how these modalities interact, we implement a Cosine Embedding Loss. We calculate this loss across every possible pair of modality projections (e.g., Brand vs. Creative, Audience vs. Platform).

    By setting a positive target for these pairs, we mathematically “pull” the vectors closer together in the latent space. This ensures that features belonging to the same campaign are mapped to a similar neighborhood in the vector space, creating a coherent, multi-modal representation.

    The classification head

    Once the projections are aligned and concatenated, they are fed into a Multi-Layer Perceptron (MLP) for the final prediction:

    1. Linear Layer: Aggregates the combined, multi-modal features.
    2. ReLU & Dropout (0.3): Provides non-linearity while preventing overfitting by randomly deactivating neurons during training.
    3. Linear Layer (output): Maps the features to the three target classes (Negative, Average, Positive).

    Optimisation

    Apart from the cosine embedding loss, we use CrossEntropyLoss, as well, to optimise the classification task, teaching the model which feature combinations lead to specific outcomes. The entire network is trained using the AdamW optimiser, which provides effective weight decay to maintain the integrity of the learned embeddings.

    Representation of the architecture Deep Learning Arch1 model

    Robust modality network (Deep Learning Arch2)

    While our first architecture focused on aligning different data types into a shared space, our second deep learning model, the Robust Modality Network, is designed for resilience. Its primary goal is to handle data sparsity and prevent the model from becoming overly dependent on a single “strong” feature, such as a specific platform or brand description.

    Deep projection and feature weighting

    Unlike the single-layer projections in Arch 1, this model uses a deeper projection block for each modality. By passing each input through a sequence of Linear -> Dropout -> Linear -> LayerNorm, the model gains enough capacity to “filter” the high-dimensional embeddings. This allows the network to internally re-weight the features, emphasising the specific parts of an embedding (like a particular interest group within the Audience modality) that are most predictive of the final outcome.

    Modality dropout: Forced independence

    One of the unique features of this architecture is the implementation of a Modality Dropout layer. Before the data reaches the final classification head, the model randomly “zeroes out” an entire modality (e.g., completely removing the “Creative” or “Geo” data for a specific training pass).

    This serves two critical purposes:

    1. Robustness: It mimics scenarios where certain data might be missing or “noisy” in production.
    2. Feature importance: It prevents the model from “cheating” by over-relying on one dominant modality. By forcing the model to make accurate predictions even when a key feature is missing, we ensure it learns the subtle interactions between the remaining inputs.

    Streamlined MLP head

    After the modalities are projected and filtered through the dropout layer, they are concatenated and fed into a robust, two-layer Multi-Layer Perceptron (MLP). This head uses ReLU activations and standard dropout to process the combined features into the final three-class prediction.

    During development, we experimented with complex attention mechanisms to weigh these modalities. However, we found that this simpler, MLP-based approach, combined with Modality Dropout, yielded equivalent performance with significantly lower computational overhead and better generalization.

    Optimisation

    As with our first architecture, we use CrossEntropyLoss to optimise the parameters and AdamW as the optimiser. This combination effectively teaches the projection layers which feature combinations lead to Positive, Average, or Negative performance while maintaining a stable training process.

    Representation of the architecture Deep Learning Arch2 model

    Experimentation

    Leaderboard

    To maintain rigorous benchmarking standards, we implemented a centralized Leaderboard platform integrated directly with the Synthetic Dataset Generator. This system acts as our definitive evaluation environment, decoupling model architecture from data preparation and ensuring every experiment is measured against the same constraints.

    We expose pre-defined train/test splits through a dedicated API endpoint. By forcing every model to use the exact same data distribution, we effectively eliminate “local validation bias”, the risk of a researcher inadvertently tuning a model to a specific random split of the data.

    The evaluation process follows a strict protocol:

    1. Researchers query the endpoint for specific versioned splits (eg v25, or v28)
    2. Models are trained on the provided local data
    3. Predictions are submitted back to the platform for server-side computation
    4. The evaluation metrics are returned and evaluated by the researcher

    This creates a “blind” evaluation process and a persistent system of record, allowing us to trace how performance evolves across different architectures and signal strengths.

    While accuracy is a common metric, it can be misleading in datasets where classes aren’t perfectly balanced. A model could achieve high accuracy simply by over-predicting the majority class while failing on the others.

    To prevent this, we use the Macro Average F1 score as our primary ranking metric in the Leaderboard. The Macro F1 treats all three classes (Negative, Average, and Positive) as equally significant. This ensures that a model must perform well across the entire spectrum to rank highly, rather than just overfitting to the most frequent label.

    To ensure each architecture performed at its peak, we utilized structured optimisation techniques to move beyond default configurations.

    Hierarchical LightGBM configuration

    • Optimisation: Conducted 10–20 Optuna search trials to maximize Leaderboard performance.
    • Efficiency strategy: Hyperparameter searches were performed on a representative subset of the data, with the final model training on the full dataset.
    • Custom evaluation metric: Optimised trials using a weighted F1 score to prioritise the classes most critical to campaign success:
      • Positives: 50%
      • Negatives: 40%
      • Averages: 10%
    • Tuned parameters:
      • pos_threshold (for Stage 1 classification)
      • Stage 1 & 2 individual parameters: num_leaves, n_estimators, learning_rate, min_data_in_leaf (separate parameters for each stage)

    Deep Learning Arch1 configuration

    • Search method: Utilized Grid Search across various architectural and training configurations.
    • Tuned parameters:
      • Network layer dimensions (categorised as Small, Medium, or High).
      • Similarity loss weighting constant (balancing the alignment of modalities vs. classification accuracy).
      • Batch size variations.
    • Training safeguards: Implemented early stopping to prevent overfitting as the latent space aligned.

    Deep Learning Arch2 configuration

    • Optimisation: Utilized Optuna, similar to the LGBM approach, for a more granular search of the neural network’s hyperparameter space.
    • Tuned parameters:
      • Learning Rate and Batch Size.
      • Dimensions for both the deep projection blocks and the MLP head.
      • Dropout rates for both individual neurons and entire modalities.
    • Dynamic training callbacks:
      • Early stopping: Ends training when validation performance plateaus.
      • Learning Rate Scheduler: Automatically reduces the learning rate by 50% if validation accuracy fails to improve, allowing for finer optimisation in the final epochs.

    Results

    We evaluated our three candidate models across the synthetic datasets to determine how architectural choices impact robustness as we move from high-signal environments to noisy, high-volume scenarios. The following results quantify each architecture’s resilience against increasing data entropy and feature sparsity, using the Macro F1 score as our primary benchmark.h-signal environments to noisy, high-volume scenarios. The following results quantify each architecture’s resilience against increasing data entropy and feature sparsity, using the Macro F1 score as our primary benchmark.

    DatasetModelOverall F1Neg F1Avg F1Pos F1
    V25Hierarchical LGBM85.11%86.4%80.21%88.73%
    Deep Learning Arch186.26%89.81%82.51%86.44%
    Deep Learning Arch289.47%91.25%85.95%91.22%
    V26Hierarchical LGBM85.54%87.48%80.92%88.21%
    Deep Learning Arch184.80%89.41%80.50%84.50%
    Deep Learning Arch288.84%91.13%85.26%90.14%
    V28Hierarchical LGBM77.40%76.40%81.47%74.34%
    Deep Learning Arch180.74%80.88%86.10%75.25%
    Deep Learning Arch281.29%82.25%84.58%77.03%
    F1 score performance of Hierarchical LGBM and Deep Learning Architectures across synthetic datasets (V25, V26, V28).

    The evaluation reveals a clear distinction between traditional gradient boosting and deep learning models when handling complex data structures.

    • The baseline limits: Our LGBM implementation demonstrated strong efficacy in high-signal environments (v25 and v26), consistently maintaining a Macro F1 score above 85%. It proved particularly effective at binary discrimination (Positive vs. Negative), leveraging its ability to find clean decision boundaries. However, we observed a quantifiable performance degradation in the high-noise, high-volume regime (v28), where the score regressed to 77.40%. This indicates a sensitivity to lower signal-to-noise ratios and sparse data.
    • The Deep Learning advantage: Both Deep Learning architectures consistently outperformed the baseline across all dataset variations. This performance delta is largely due to the networks’ ability to learn dense representations within a latent space. Unlike tree-based logic, which relies on hierarchical feature splitting, these models (specifically DL Arch 2) effectively capture non-linear interactions between modalities like Brand, Geo, and Audience.
    • Generalization under pressure: The architectural advantage became most pronounced in the v28 dataset. While the tree-based model struggled to generalize under high sparsity, the Deep Learning models maintained a Macro F1 score above 80%.

    These results confirm that an embedding-based approach, combined with modality-specific projection and dropout, offers superior generalization in high-dimensional spaces. It allows the model to distinguish a weak signal from statistical noise without overfitting to the majority class.

    Limitations

    While the results on our synthetic benchmarks are promising, several technical and methodological limitations remain that will guide the next phase of our research.

    The embedding quality gap

    A primary constraint lies in the nature of “off-the-shelf” embeddings. Even with high-performing models like text-embedding-005, we must consider how well a generic model captures domain-specific nuances.

    For example, for the brand modality, a descriptive paragraph is compressed into a fixed-length vector. It remains an open question whether this representation truly captures the “brand universe” or the subtle differentiators between competing identities. If the embedding space is too crowded or lacks the granularity to distinguish between two similar but distinct brand strategies, the downstream model’s predictive power will inevitably hit a ceiling.

    Synthetic vs. real-world fidelity

    Working with synthetic data allows for a deterministic “ground truth,” but it introduces a gap in realism. While our generator mimics complex relationships, we face a few core uncertainties:

    • Noise consistency: Is the mathematical noise we introduce in datasets like V28 truly representative of the “messy” data found in real campaigns?
    • Modality alignment: Real campaign data often contains unstructured elements that don’t fit perfectly into our five-modality graph.
    • The “switch” risk: There is always the risk that a model optimised for a synthetic environment will struggle with the unpredictable variance of live campaign performance. Transitioning from the laboratory to the wild is the ultimate test of our current architectures.

    The infinite search space

    In machine learning, the architecture search and experimentation lifecycle is never truly “finished.” The sheer volume of potential configurations, from different attention mechanisms to alternative fusion strategies, requires significant time and compute resources to validate.

    Prioritising which ideas to pursue is a constant challenge. Every “definitive answer” we find often opens up several new branches of inquiry, meaning our current “best” model is likely just a stepping stone toward a more refined solution.

    Future work & next steps

    With the synthetic benchmarking phase yielding concrete results, our focus shifts toward validating these findings in production environments and refining our feature engineering pipelines. The transition from a controlled “laboratory” setting to the complexity of live data will be the ultimate test for our current architectures.

    • Real-world validation: The immediate priority is to migrate our top-performing architectures from the controlled synthetic environment to the real world. We will test whether the architectural advantages observed in Campaign Performance Modelling Pod translate effectively to real-world campaign data (Data Enrichment Pod).
    • Continual learning & data privacy: Develop strategies for continuously updating models with new data without compromising previously acquired knowledge (e.g., mitigating catastrophic forgetting). This is crucial for integrating new clients, predicting their campaign performance, and exploring paradigms like federated learning to ensure data privacy and secure knowledge sharing across different entities (Multimodal Federated Learning Pod).
    • Feature representation enhancement: We are launching dedicated workstreams to improve the fidelity of our embeddings:
      • Creatives: Optimising the captioning pipeline to capture visual semantics and context more accurately (Self Improvement Performance Agent Pod),
      • Brand: Developing richer, more descriptive representations of brand identity beyond simple identifiers (Brand Perception Atlas Pod).
      • Geo: Investigating optimal encoding strategies for real-world geospatial data, moving beyond the synthetic zip code generation used in this benchmark.
    • LLM-centric experiments: We aim to explore Large Language Models beyond just feature encoding. Future iterations will test LLM Fine-Tuning for direct classification tasks, as well as leveraging generative models for advanced oversampling techniques to address class imbalance in sparse datasets (LLM Finetuning Pod and Data Enrichment Pod)
    • Architectural evolution: We will continue to refine our current Deep Learning and Tree-based baselines while actively scouting for novel architectures that may offer better inductive biases for our specific data modalities (experimentation with AlphaEvolve)
    • Recommendation systems: The ultimate objective is to deploy these architectures as the backbone of a recommendation engine. A dedicated exploration phase is required to bridge the gap between prediction and prescription. It is a known paradox that the most accurate predictor is not always the most effective recommender; depending on the optimisation strategy used, a model that captures the dense, underlying relationships between features may be more valuable than one optimised purely for label discrimination.

    Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP Research team.

  • From guesswork to foresight: How AI is predicting the future of marketing campaigns

    Ever wonder why some advertisements seem to pop up exactly when you’re thinking about buying something, while others feel completely irrelevant? Or how a brand knows just the right message to share to get your attention? The answer lies in the evolving world of marketing campaigns, and increasingly, in the powerful capabilities of Artificial Intelligence (AI).

    But what is a marketing campaign?

    At its core, a marketing campaign is a carefully planned series of activities designed to achieve a specific goal for a business – whether that’s selling more products, building brand awareness, or encouraging people to sign up for a service. Think of it like launching a rocket: you need to choose the right destination (your objective), design a powerful engine (your creative message), select the perfect crew (your audience), and pick the best launchpad (your platform).

    The process of creating and running these campaigns involves countless decisions, such as:

    • Audience: Who are we trying to reach? What are their interests, demographics, and behaviours?
    • Brand: What message do we want to convey about our brand? How does our brand resonate with the audience?
    • Creative: What kind of ads should we run? (text, images, videos, headlines, calls to action).
    • Objective: What’s the main goal? (e.g., getting clicks, making sales, increasing brand recognition).
    • Platform: Where should we run these ads? (e.g., Facebook, Instagram, Google Search, TV, billboards).

    Campaign design is a complex process shaped by multiple factors, such as creative genius, market insights, and domain expertise. Marketers launch campaigns, closely track their impact, and then adjust their approach in real time, refining messages or recalibrating target audiences. However, given the diverse and dynamic nature of consumer behavior, this iterative adaptation process can be taxing in terms of both budget and time. It’s like setting a rocket’s course: unforeseen atmospheric shifts can require significant mid-flight corrections, each consuming valuable resources.

    This is where the big challenge lies: how do we predict if a campaign will be successful before we invest significant time and money into it?

    Machine Learning: Your marketing crystal ball

    This challenge is precisely where Machine Learning (ML) steps in. Simply put, Machine Learning is a branch of AI that allows computers to “learn” from data without being explicitly programmed. Instead of following a strict set of rules, ML algorithms analyze vast amounts of past information, identify hidden patterns and relationships, and then use those learnings to make predictions or decisions on new, unseen data.

    In the context of marketing campaigns, ML becomes an incredibly powerful tool:

    • Data powerhouse: Imagine collecting every detail from thousands of past marketing campaigns: who saw them, what the ads looked like, where they were shown, how much they cost, and crucially, what the final outcome was (e.g., how many clicks, sales, or sign-ups they generated). ML algorithms can digest this colossal amount of data in seconds.
    • Pattern recognition: These algorithms don’t just store data; they look for correlations. Did campaigns with a specific type of image perform better with a certain age group? Does a particular headline style lead to more conversions on one platform versus another? ML can uncover these subtle yet powerful insights that human analysts might miss.
    • Predictive power: Once trained, an ML model can take the proposed details of a new campaign (e.g., its target audience, creative idea, intended platform) and predict its likely outcome. It can estimate click-through rates, conversion probabilities, or even the potential return on investment (ROI) before a single dollar is spent.

    The benefits are transformative: marketers can make data-driven decisions, allocate budgets more efficiently, target the most receptive audiences with precision, and ultimately, launch campaigns with a much higher probability of success. It’s like having a detailed weather forecast for your rocket launch, helping you choose the perfect day and trajectory.

    The multimodal challenge: Mixing apples, oranges, and billboards

    In reality, and contrary to what many might assume, a campaign isn’t just a neat row of numbers on a spreadsheet; it’s a vibrant, messy mix of text, images, locations, and abstract concepts like brand identity. This presents a fundamental challenge: how do we empower AI to not just process, but truly understand and effectively connect these inherently different types of information to form a holistic view? For instance, how can an AI understand the interplay between the nuanced visual cues of a video Ad with the detailed socio-economic data of a specific target audience in a specific location?

    The “secret sauce” is a technology called embeddings. Think of an embedding as a universal translator. It takes complex information, like the “feeling” of a brand or the intent of a sentence, and turns it into a list of numbers that an algorithm can easily digest. However, every piece of the campaign puzzle requires a different translation strategy.

    Translating the campaign puzzle

    To build a complete picture, we process each element through a specialized lens:

    • Audience, Platform, and Objective: We convert these categories into numerical “flavours.” This allows the AI to recognise the distinct profile of, for example, an Instagram awareness campaign versus a search engine lead-generation tactic.
    • Brand identity: We leverage the fact that Large Language Models (LLMs) already possess a wealth of knowledge about established brands. By feeding the AI a rich, descriptive profile of a brand, we create a deep numerical representation of its identity. This task is so nuanced that it led to the birth of our Brand Perception Atlas Pod.
    • Creative (Images): A picture may be worth a thousand words, but our models currently prefer numbers. To bridge this gap, we use AI to extract a highly detailed description of each image, which is then translated into data. We quickly discovered that the quality of these descriptions depends entirely on the instructions given to the AI. This led us to develop the Self-improving AI Agent.
    • Geography: Location is more than just a pin on a map. To capture the true essence of a region, we use advanced models that go beyond coordinates. In detail, Google’s PDFM (Pre-trained Deep Foundation Models) Embeddings are able to capture the social, economic, and demographic fabric of an area, providing the AI with the “soul” of a location rather than just its name.

    Where does the data come from?

    Real-world marketing data is essential, but on its own it is not enough for AI research. At WPP, we combine rich, real-world data with carefully engineered synthetic data to build and evaluate models more effectively. Real data grounds our work in genuine market behaviour, complexity, and business context. Synthetic data adds something equally important: control. It allows us to create the specific conditions we need to properly challenge, probe, and improve our models.

    This matters because many of the scenarios that determine whether a model is truly robust are rare, emerging, or simply absent from historical records until the moment they become a real problem. To prepare for that, we deliberately generate datasets that introduce edge cases, shifting patterns, variable data volumes, heterogeneity, sparsity, and data drift. In other words, we use synthetic data to stress-test models in ways that real data alone cannot support, so they are more resilient, reliable, and ready for the real world.

    To address this, we built a Synthetic Data Generator. Think of this as a high-fidelity flight simulator for marketing. Instead of testing our models only on the limited “flights” we’ve taken in the past, this tool creates realistic, artificial campaign data. This allows us to:

    • Train with precision: We can create scenarios that haven’t happened yet to see how the AI reacts.
    • Test the limits: We can stress-test our models against extreme market conditions without any real-world risk.
    • Ensure safety: We can evaluate performance using high-quality data that carries none of the privacy concerns of personal information.
    • Hold the answer key: Because we generate this artificial data from scratch, we already know the exact outcome (the “ground truth”) of every scenario. It’s like giving our AI a test where we already hold the perfect answer key, allowing us to verify its predictions and recommendations.

    By “conjuring” this artificial data, we ensure our models are battle-tested and ready for the complexities of the live market.

    From data to decisions: Empowering the expert

    We’ve explored the “ingredients” and the “recipe,” but what does this actually look like in the hands of a marketing expert? Our goal isn’t just to crunch numbers; it’s to provide actionable recommendations that make experts more efficient and their campaigns more successful.

    Imagine a strategist coming to the platform with a specific mission:

    I’m launching a campaign for Brand B, targeting Audience A in Location X, with the objective of Increasing Awareness. What is the best platform and creative style to use?

    To answer a question like this, we need more than a search engine, we need a Predictive Engine, our “crystal ball”.

    Before we can offer a recommendation, we must train a Machine Learning (ML) model to understand performance. We teach it to look at millions of historical and synthetic data points to predict an outcome: Is this specific combination of elements likely to be Good, Average, or Bad?

    There isn’t just one way to build this crystal ball. In our research, we explore a spectrum of algorithms, including both traditional models and modern techniques. Each approach offers its own set of advantages: some prioritise speed, while others prioritise pinpoint accuracy. By testing across this variety, we ensure that when an expert asks for a recommendation, the answer is backed by the most robust mathematical thinking available today.

    1. The reliable workhorse: LightGBM

    We started with a classic, high-speed approach called LightGBM. Think of this as a highly efficient logic tree. It’s fast, dependable, and excellent at spotting clear patterns in structured data. It serves as our “baseline”, the standard we aim to beat.

    2. The specialist team: Neural Networks

    Next, we built a more sophisticated system, based on Neural Network architectures, that works like a well-organized corporation. We divided the AI into two stages:

    • Specialized departments: Each type of data (like your brand identity or your creative images) is handled by its own “mini-expert” that decides which details are actually important.
    • The executive board: Once the experts have done their work, a central “manager”, which is called MLP, looks at all the reports together to make the final call: Will this campaign succeed?

    In this category, we have experimented with multiple different architectures and techniques. For example, one of our best models, before making a prediction, it mathematically groups elements that “belong” together. If a specific high-energy image consistently drives high success when paired with a young, active audience, the model learns to pull those winning pieces closer. This not only makes the model smarter but also helps us give you much better recommendations for future pairings.

    3. The language experts: LLMs

    Finally, we tested whether a standard Large Language Model (like the ones used for chatbots) could do the job on its own. Interestingly, we found that “out-of-the-box” AI isn’t naturally great at these specific marketing predictions. However, when we provide specialized training (process called “Fine-Tuning”) their performance skyrockets, as evidenced by our research: From hype to impact: Predicting campaign performance with fine-tuned LLMs.

    The verdict: Measuring impact

    To evaluate our models and determine how accurately they predict campaign performance, we must first establish a rigorous testing ground. This involves two key components: the diversity of our data and the precision of our metrics.

    Datasets

    To ensure our findings aren’t just a “lucky” outlier, we don’t rely on a single source of information. Instead, we test every model against three different versions of our synthetic datasets. By proving that our models can perform consistently across various simulated environments, we can be confident that their predictive power is both reliable and adaptable to real-world shifts.

    While each of our datasets shares a consistent structure, we have intentionally varied their internal characteristics to put our models through a rigorous stress test. By using our Synthetic Data Generator, we can precisely control three key variables to create progressively more challenging environments:

    • Volume: Testing how the models perform with both limited information and vast amounts of data.
    • Balance: Adjusting the “label distribution”. For example, creating datasets where “average” results are far more common than clear successes or failures, to reflect the reality of a crowded market.
    • Signal strength: Tuning how obvious or subtle the patterns are, which forces the models to work harder to find the winning combinations.

    This approach ensures that our models aren’t just memorizing easy patterns, but are truly learning to find value in complex, “noisy” environments where the right answer isn’t always obvious.

    Model performance

    When it comes to measuring performance, we use a standard industry benchmark known as the F1 score, because simple accuracy can be a liar. Imagine you have a box of 100 fruits: 10 are apples and 90 are oranges. You build a robot to grab only the apples. If the robot sits still and does nothing, it is technically “90% accurate” because it correctly ignored the 90 oranges, but it’s a total failure at its job. The F1 score exposes this by balancing two hidden grades:

    • Precision (the “quality” grade): When the robot grabs a fruit and says “Apple,” is it right? High precision means it never accidentally grabs an orange.
    • Recall (the “completeness” grade): Did the robot find all 10 apples, or did it leave some behind? High recall means the robot is thorough and doesn’t miss any.

    The F1 score is a single number that averages these two. Unlike a normal average, it “punishes” extreme failure. If your robot is perfectly accurate but misses every single apple, its F1 score will be 0. This gives us a much more honest picture of how well a model actually works in the real world. To circle back to our case, we use the F1 scores in two ways:

    • The big picture: We report the Average F1 score across the entire dataset to show overall model health.
    • Performance by category: We break down results into three specific classes: Negative, Average, and Positive.

    This granular view is where the true business value lies. It allows us to ensure the model excels at the extremes, identifying the “Negative” combinations a marketer should avoid at all costs, and the “Positive” combinations that will truly drive results beyond the status quo.

    To bring structure to our innovation, we developed a centralized model Leaderboard. This platform serves as the definitive “source of truth” for our research team, ensuring that every breakthrough is measured against the same rigorous standards. The Leaderboard allows team members to download standardized training and testing splits for any dataset (whether real or synthetic) and submit their results for comparison. By centralizing our findings in one place, we achieve several key advantages:

    • True comparability: We can be certain that we are comparing equals across different algorithms and techniques.
    • Accelerated testing: It allows us to quickly and safely iterate on new ideas without reinventing the wheel.
    • Institutional knowledge: It creates a permanent record of our progress, ensuring that the best-performing models are always visible and ready to be deployed.

    This structured environment is what allows us to move from individual experiments to a scalable, high-efficiency engine for marketing AI.

    With our datasets defined and our Leaderboard in place, we put our models to the ultimate test. By measuring how each approach handled “Negative,” “Average,” and “Positive” campaign outcomes, we can clearly see which strategies offer the most reliable path to success.

    Here is a glimpse of how our top models performed across the board:

    DatasetModelOverall F1Neg F1Avg F1Pos F1
    Small and easyTree-based85.11%86.4%80.21%88.73%
    Deep Learning (v1)86.26%89.81%82.51%86.44%
    Deep Learning (v2)89.47%91.25%85.95%91.22%
    Small and slightly noisyTree-based85.54%87.48%80.92%88.21%
    Deep Learning (v1)84.80%89.41%80.50%84.50%
    Deep Learning (v2)88.84%91.13%85.26%90.14%
    Big and slightly noisyTree-based77.40%76.40%81.47%74.34%
    Deep Learning (v1)80.74%80.88%86.10%75.25%
    Deep Learning (v2)81.29%82.25%84.58%77.03%
    F1 score Performance of Top Models (Tree-based, Deep Learning v1, and Deep Learning v2) Across Varied Datasets

    Analysing the Leaderboard: Reliability at scale

    The results from our testing provide a clear picture of how these models handle real-world complexity.

    Our tree-based model remains a formidable workhorse, maintaining an F1 score above 85%. Most importantly, it demonstrates high accuracy in identifying “Positive” and “Negative” outcomes. This means the model is exceptionally reliable at flagging the two things marketers care about most: which campaigns are likely to be massive successes and which ones are headed for failure. While performance naturally dips as we introduce more noise and scale into the datasets, its baseline remains impressively high.

    While the classic models are strong, both of our Deep Learning approaches consistently take the lead. These models perform better because of their inherent capacity for “relational intelligence”, they can spot the subtle, complex connections that simpler, logic-based systems often miss.

    As the datasets grow larger and the patterns become more “noisy,” this deep understanding becomes a critical advantage. Seeing these models maintain performance above the 80% mark, even in the most challenging scenarios, gives us the confidence that our AI can handle high complexity scenarios.

    The privacy puzzle: Learning without sharing

    While our research shows how powerful these models can be, a significant question remains: How do we build an elite AI that learns from everyone, without exposing anyone’s private data?

    In the traditional world of marketing, building a “super-brain” meant pooling all client data into one giant, central database. In today’s world, that is a massive privacy red flag. We believe you shouldn’t have to choose between competitive intelligence and data security. To solve this, we utilize a cutting-edge approach called Federated Learning (FL).

    Think of Federated Learning like a team of specialized doctors working in different hospitals. To find a cure for a new disease, they don’t send their private patient files to a central office, that would be a breach of trust. Instead, each doctor studies their own patients locally, they discover what works and what doesn’t and then they share only the “recipe for the cure” with their colleagues, never the patient’s identity.

    In our ecosystem, each client trains the AI model locally on their own private data. The “lessons learned” are sent back to our main server, where they are combined to create a smarter, globally-informed model for everyone.

    The result? You benefit from a model that has “seen” millions of scenarios, yet your private data never leaves your hands.

    Discover more about Federated Learning by reading our post: Training together, sharing nothing: The promise of Federated Learning.

    Looking to the Future: The Next Frontier

    Our journey doesn’t end with a successful prediction. We are already exploring the next horizon of marketing intelligence, moving from understanding the past to actively designing the future. Here is what we are building next:

    1. Infusing data with “common sense”

    What if our models understood human psychology as well as they understand spreadsheets? We are exploring ways to inject the broad, contextual knowledge of LLMs directly into our training data. This gives our models a “common sense” layer, allowing them to understand the subtle cultural and social nuances that drive human behavior. Learn more about it in our post: The uncharted territory: Beyond the known data

    2. AI building AI

    We believe the best architect for a complex model might be the AI itself. Instead of manually designing every layer of a neural network, we are using advanced systems to automatically discover the ultimate model structure for marketing predictions. This process of “digital evolution”, which we delve into further in our AlphaEvolve article, ensures our tech is always one step ahead.

    3. From predictions to proactive recommendations

    We are currently building a tool that doesn’t just predict success, it suggests it. Imagine entering your brand and target audience and having the AI instantly recommend the perfect visual or message. We are perfecting this using two unique methods:

    • The Matchmaker space: Utilizing the “relational map” we built in our earlier phases to instantly pair your audience with the creative assets they are most likely to love.
    • The “hot or cold” optimisation: We treat our AI like a high-precision compass. If you have a brand and an audience but are missing the right “creative,” the system rapidly tests thousands of variations. It plays a high-speed game of “hot or cold” until it locks onto the highest possible performance score.

    By moving from educated guesswork to advanced, multimodal AI, we are finally bridging the gap between creative intuition and measurable results. The rocket is fueled, the coordinates are set, and the launch sequence has officially begun.

    Ready to explore the specifics? Read our full technical deep dive into Multimodal Fusion Models for a closer look at our methodology.

    Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP Research team.

  • Data Enrichment Pod

    Data Enrichment Pod

    Machine Learning excels at identifying historical patterns, but its predictive power often hits a wall when faced with entirely new products or shifting audience behaviors where data is scarce. We investigated if Large Language Models (LLMs) could bridge this gap by augmenting our proprietary campaign data with LLM-generated insights, employing Hybrid Graph creation and Active Learning strategies. However, integrating generic LLM knowledge often introduced label noise, either significantly degrading performance or yielding only a negligible 1% improvement. This demonstrates that high-fidelity, proprietary data remains our strongest predictive asset, and future AI integration requires domain-specific fine-tuning rather than relying on the broad knowledge of off-the-shelf LLMs.

    If you don’t care about the technical details, read our blog post instead.

    Can LLMs predict campaign success? Using hybrid data techniques to bridge data gaps

    The core question driving our research is whether Large Language Models (LLMs) can meaningfully improve the prediction of real-world campaign performance.

    Can LLMs provide the latent domain knowledge necessary to bridge gaps in existing campaign data without causing signal dilution?

    To test this, we developed and compared two distinct methodologies:

    • Hybrid Graph construction: We extracted high-confidence relationships from our real campaign data and used an LLM to “fill in the blanks,” creating a unified knowledge graph. This graph was then used to generate augmented datasets for model training.
    • Active Learning with an LLM oracle: We implemented a targeted feedback loop where our model identifies its own “areas of confusion”, which are subspaces where it lacks confidence. We then used an LLM as an expert “oracle” to provide specialised labels for these high-value points, strategically refining the training set.

    Datasets & preparation

    The dataset

    The foundation of our analysis is a dataset extracted from real-world campaign data provided by WPP. This data tracks the daily performance of various brand campaigns, capturing core metrics like clicks, conversions, and views.

    Beyond performance, the dataset includes granular data for each campaign, such as:

    • Targeting: Audience segments and geographic locations.
    • Environment: Platform (e.g., Social, Search) and device type.
    • Strategy: The specific objective of the campaign and delivery settings.

    To manage the complexity of the dataset and ensure our models capture the relationships between different types of information, we organised our features into modalities. Rather than treating every column as a flat list, we grouped related variables into distinct “pieces of the puzzle”. The primary modalities we defined include:

    • Audience: gender, age group, generation, interests, education (major and schools), industries, work (positions and employers), behaviours, relationship statuses and life events. Additionally, there are custom audiences available and interests that are excluded from targeting.
    • Brand
    • Creative: image or video used in the campaign
    • Geo: geographic markers ranging from countries to zip codes.
    • Platform: platform, placement and device
    • Objective: delivery setting of the campaign

    Initially, we intended to include creative-level data. However, we found a high volume of missing values in this field. To maintain a larger, more robust sample size for the models, we decided to omit creatives from this iteration of the project.

    Data labelling

    We assigned each campaign a performance label (Positive, Average, or Negative), to serve as our ground truth. While we generated separate label sets for clicks and conversions to maintain flexibility, we chose engagement as the primary driver for our current classification tasks.

    To ensure these labels reflect true performance rather than just the size of the budget, we moved away from raw totals. Instead, our labelling methodology relies on two key factors:

    • Efficiency weighting: All metrics are weighted by spend. This allows us to identify campaigns that are over-indexed on performance relative to their cost.
    • Quantile distribution: Labels are assigned based on where a campaign falls within the broader distribution of the dataset. This categorical approach helps the model distinguish between “high performers” and “underperformers” within a normalised context.

    For a deeper dive into the specific weighting formulas and the statistical thresholds used for these quantiles, you can refer to the Campaign Intelligence Data Pod .

    Data preprocessing

    To ensure our research remained focused and the results highly relevant, we narrowed our scope specifically to video campaigns. Video data offers a rich set of engagement metrics and distinct delivery patterns that differ significantly from static display or search ads.

    Before training and testing our models, we perform a preprocessing workflow. Raw campaign data is rarely model-ready; it requires specific transformations to turn disparate data points into a structured format suitable for encoding. To maintain the modular structure of our data, we process each modality by concatenating its core attributes into unified strings or arrays. This approach helps the model understand the context of each data “piece” as a whole.

    Modality refinement

    • Audience: We narrowed our focus to Gender, Generation, and Interests/Attributes. Within “Interests and Attributes,” we bundled together specific fields like education, industry, and job titles. We intentionally excluded other audience categories that either lacked clear business value or suffered from high rates of missing values.
    • Platform: This modality combines the platform (e.g., Instagram), the placement (e.g., Stories), and the device (e.g., Android Mobile) into a single, descriptive feature.
    • Objective: We concatenated all delivery settings and optimisation goals associated with a campaign to capture the intended strategy.
    • Geo: We focused on the country level. In cases where country data was missing but region or city data was present, we used a custom mapping tool to retrieve and fill in the corresponding country.
    • Brand: This includes specific advertiser identifiers and category markers.

    Aggregation

    Once the columns are selected and formatted, we perform a final aggregation to eliminate any duplicate entries. In cases where multiple entries exist for the same campaign parameters, we take the mean label to ensure the target variable accurately reflects the consensus performance.

    Dataset analysis and exploration

    After completing the preprocessing and aggregation pipeline, our final dataset comprises 668,871 rows. This represents a diverse cross-section of global advertising, covering 339 distinct brands across 157 countries.

    With a dataset of this size, there is a risk that a model might “cheat” by memorising specific feature values that happen to correlate with a label, rather than learning the underlying relationships between variables. For example, if a specific platform consistently shows a “Positive” label simply due to how data was collected, the model might become biased toward that platform regardless of the actual campaign performance.

    To prevent this, we conducted a distribution analysis:

    • Label distribution per feature: We plotted the distribution of labels (Positive, Average, Negative) against individual feature values.
    • Bias detection: We specifically looked for “leaky” features or skewed distributions that would give the model an unfair hint.
    • Verification: By ensuring that labels are relatively balanced across our main categories, we confirm that the model must rely on the interaction of multiple modalities, rather than a single biased variable, to make an accurate prediction.

    This step ensures that the resulting insights are based on genuine performance drivers rather than statistical noise or data collection artefacts.

    Representative visualisations of the label distribution for some of the main features:

    Label distribution by Gender
    Label distribution by Country
    Label distribution by Age generation
    Label distribution by Platform

    Encoding inputs

    Since our processed dataset is largely composed of text, we face a fundamental challenge: machine learning models cannot process raw language. To bridge this gap, we must convert these text fields into a numerical format that the algorithm can interpret.

    For this project, we are using the text-embedding-005 model (part of the Google Gemini family) to handle all text-based fields.

    Rather than using simple “one-hot encoding” (which treats words as isolated categories), an embedding model transforms text into a high-dimensional vector. This process captures the semantic relationship between different values. For example, it allows the model to “understand” that “Instagram Stories” and “Facebook Reels” are more similar to each other than they are to “Desktop Search,” even if the raw text doesn’t match exactly.

    The steps taken to get the encoding are:

    • Input: The concatenated strings from our modalities (e.g., “Meta | Mobile | Stories”).
    • Transformation: The embedding model maps these strings into a fixed-length numerical array and based on task (semantic similarity).
    • Result: These vectors serve as the “features” that our model uses to identify patterns and predict campaign performance.

    By using a state-of-the-art embedding model, we ensure that the nuances of our campaign metadata are preserved, providing the predictive model with a rich, context-aware starting point.

    Methodology

    To test our hypotheses, we explored two distinct approaches. Our goal was to determine if LLMs could effectively augment real-world performance data to fill gaps in the campaign landscape.

    Strategy I: Hybrid Graph construction

    To model the complex relationships in our data, we built a signed undirected graph. In this structure, nodes represent specific campaign attributes (like “Female” or “Instagram”), and edges represent the relationships between them. These edges are signed: a positive edge indicates a successful performance link, while a negative edge indicates an underperforming one.

    Node selection

    The challenge was defining these nodes so they could seamlessly accommodate both real-world campaign results and synthetic insights from an LLM. We achieved this through a three-step process:

    1. Strategic node definition

    We mapped distinct entities from our dataset into graph nodes using a mix of aggregation and expansion:

    • Direct mapping: Standard categories like Brand, Country, Device, and Platform were assigned individual nodes.
    • Informed profiles: We merged Gender and Generation into single nodes (e.g., “Gen Z, Female”) to create more robust audience profiles.
    • Granular objectives: While our earlier preprocessing concatenated delivery settings into one string, we exploded these back into individual components for the graph. This allows the model to capture fine-grained relationships between specific strategies (like bidding types) and performance.

    2. Modality trimming for efficiency

    While it’s tempting to include every data point, we had to balance detail with computational reality. Generating LLM-based edges for every possible combination would be incredibly time-consuming.

    Furthermore, we found that LLMs struggle with highly “niche” technical fields, such as specific targeting relaxation settings. To maintain accuracy and efficiency, we trimmed our focus to high-impact modalities with strong business relevance, including gender & generation, interests & other attributes, brand, geo, platform, device, campaign buying type, campaign objective, media buy billing event, media buy cost model and brand safety content filter levels.

    3. Managing high-dimensionality (clustering)

    Certain modalities, particularly Audience Interests, contained thousands of unique values (e.g., specific job titles or education majors). To prevent the graph from becoming unmanageable, we used K-Means clustering to group these into a controlled number of clusters.

    Because automated clustering can sometimes group items based on superficial traits like the language the text was written in, we used an LLM to name these clusters and performed a manual review. This ensured that a cluster actually represented a coherent segment (e.g., “Tech Professionals”) rather than just a collection of similar-looking strings.

    Edge selection

    Once the nodes are defined, the next step is determining the edges between them. We focus on pre-selected pairs of modalities that carry the most business impact. For each pair, we follow a three-step sequential pipeline to decide which relationships are strong enough to enter our hybrid graph.

    Phase 1: High-confidence direct edge promotion

    First, we look at the real-world dataset to find undisputed patterns. We aggregate the data by modality pairs (e.g., Platform and Brand) and count the occurrences of each performance label. To ensure we only promote the most reliable relationships, we use two filtering metrics:

    • Cardinality: This represents the raw volume of evidence for the most frequent performance label. By requiring a large number of occurrences for the winning label, we ensure that the relationship is backed by a significant sample size rather than a few lucky instances.
    • Purity: This measures the strength of the consensus. It is the percentage of times the winning label appears out of all total observations for that specific combination. A high purity (e.g. >= 85%) ensures the data isn’t split or “noisy,” showing a clear, dominant trend.

    If a combination meets both thresholds, it is directly promoted to the graph. We promote edges with “Positive” or “Negative” outcomes only, discarding “Average” labels to keep the graph’s signals clear.

    Phase 2: LLM-validated edges

    Next, we address the “middle ground,” which represents relationships where a pattern exists in the data but isn’t quite strong enough for direct promotion. We’ve experimented we multiple thresholds, ending with purity ≥ 55% as our selection.

    For these edges, we use an LLM as a tiebreaker. We present the candidate edge to the LLM and ask for its opinion on whether the combination is good or bad. If the LLM’s response aligns with our data’s majority label, the edge is promoted. This step acts as a sanity check, using the LLM’s broader context to confirm subtle trends found in our dataset.

    Phase 3: Synthetic edges

    The final step involves edges that are either entirely missing from our dataset or are highly uncertain (Purity < 55%). Since the total number of possible combinations is massive, we use a strategic sampling approach:

    • Graph Density: we sample a subset of possible combinations based on a predefined density parameter.
    • Proportional Correction: If certain modalities had low acceptance rates in the previous two steps, we increase their representation in this sample to ensure the graph remains balanced.

    For every sampled edge, we ask the LLM to predict the sign (Positive/Negative). These purely synthetic insights are then added to the graph.

    Hybrid dataset generation

    By the end of this process, we have a Hybrid Graph that blends high-certainty real-world data, data-driven patterns validated by AI, and purely synthetic logical bridges.

    This allows us to produce a robust, augmented dataset that we can use to train our final models, evaluating whether this hybrid methodology improves predictive performance compared overusing raw data alone.

    While the Hybrid Graph uses LLMs to label sparse, low-signal regions, it is fundamentally a passive data augmentation strategy. This prompted a shift in our objective: rather than arbitrarily imputing missing data, how can we optimise our query strategy to sample the most informative instances? Identifying the specific data points in the feature space that yield the highest expected information gain forms the basis of our second approach: Active Learning.

    Strategy II: Active Learning

    Beyond the static graph, we implemented an Active Learning loop to optimise a targeted knowledge expansion. Instead of labelling and training on a massive, randomly selected dataset, this paradigm identifies the specific “subspaces” where the model is weakest: areas of high uncertainty. The core philosophy is efficiency: by querying an “oracle” (in our case, an LLM) to label only these high-value points, we can significantly improve the model’s performance across the entire distribution using far fewer examples.

    To find these high-value candidates, we first had to define a “Domain”, a delimited space of valid feature combinations. To keep this search space manageable and ensure the data remained realistic, we simplified our feature set:

    • Geography: We restricted candidates to single-country rows.
    • Interests: We used the semantic clusters created for our graph rather than the raw, high-volume interest list.
    • Demographics: To mirror real-world targeting, we converted generations into booleans (e.g., Millennial: Yes/No), allowing multiple generations to coexist. Gender was standardised to three specific nodes: Female, Male, and All.
    • Platform & Device: These were separated into distinct, individual attributes.
    • Objective: We treated the campaign objective as a single, unified block. This prevented the system from generating illogical combinations, such as pairing a “Link Clicks” objective with an incompatible cost model like “Page Engagement”.

    By narrowing the domain this way, we ensured the LLM oracle was only labelling campaign scenarios that could actually exist in a real-world media plan. To determine the most effective way to refine our model, we tested three distinct strategies for selecting and labelling data points.

    Method 1: Library-based optimisation (BoFire)

    Initially, we attempted to leverage BoFire, a library designed for Bayesian optimisation. While it offers robust strategies for active learning, we encountered significant implementation barriers that made it impractical for our specific use case:

    • Memory overhead: The library’s core strategies consumed excessive memory, which did not scale with our dataset size.
    • Categorical constraints: Applying constraints to categorical features proved difficult outside of purely random approaches.
    • Scalability issues: Even lighter strategies (like SOBO) paired with Random Forest backbones consistently triggered Out-Of-Memory (OOM) errors. These limitations led us to pivot toward custom implementations.

    Method 2: Random Pool strategy

    To move past these library constraints, we developed a custom Random Pool approach. This method focused on broad exploration of the campaign space:

    • Candidate generation: We created a pool of synthetic candidates by randomly combining feature values from our existing data domain.
    • Uncertainty sampling: After training a model on our baseline data, we used it to predict the performance of these synthetic candidates.
    • Selection & labelling: We identified the top n candidates where the model showed the highest uncertainty. These were then sent to our Google Gemini oracle to be labelled and integrated into the training set.

    Method 3: Boundary Point exploration

    While the Random Pool offered broad coverage, generating and predicting millions of combinations is computationally expensive. To maximize the value of every oracle query, we transitioned to the “Boundary Point Exploration” strategy. This method targets the “edges” between performance classes, the exact points where the model’s decision-making is most fragile.

    The workflow includes the following steps:

    • Initialisation: We begin by training a model on the current dataset and randomly selecting reference rows from each class (Positive, Negative, Average), as starting points.
    • Perturbation: We attempt to “flip” a data point from one class to another. For example, starting with a “Positive” campaign, we iteratively replace its feature values with those from a “Negative” campaign.
    • Boundary detection: We query the model at every step of this transformation. The moment the prediction flips (e.g., from Positive to Negative), we identify that specific configuration as a boundary point: a data point lying directly on the model’s decision margin.
    • Uncertainty sorting: We calculate the Shannon entropy for these boundary points to isolate the top K examples with the highest uncertainty.
    • Contextual labelling (Few-Shot oracle): To ensure high-quality labels for these tricky boundary cases, we use a Few-Shot Learning prompt. We provide the Gemini oracle with the m closest real-world rows from our dataset for each class as context. By seeing these real examples and their ground-truth outcomes, the LLM can calibrate its answers to our specific data distribution.
    • The loop: These high-value, context-aware candidates are integrated into the training set. This cycle repeats until we reach our total labelling budget n.

    Experiments

    To validate our methodologies, we compared two high-performing architectures from our previous research: LightGBM (LGBM), a tree-based gradient boosting model, and a custom Deep Learning (v2) model.

    Our experimentation setup included:

    • Validation: We used a standard 80/20 train-test split, ensuring a 20% holdout for final evaluation on unseen data.
    • Hyper-parameter tuning: We utilized Optuna for automated hyper-parameter tuning. By maintaining the same tuning parameters as our previous benchmarks (e.g., learning rate and tree depth), we ensured a controlled environment, which has been proven to work in the past.

    A detailed configuration per methodology is provided below.

    Hybrid Graph configuration

    • High-confidence edges metrics: To consider an edge highly confident we require purity ≥ 85% and cardinality ≥ 20.
    • Confident edges to be LLM approved metrics: To consider an edge confident and get the LLM as the final approver, we require purity ≥ 55% and cardinality ≥ 5.
    • Graph density: 60%
    • LLM batch size: 40

    For all relevant experiments, we use the graph with the aforementioned configuration and we separate them based on the numbers of rows requested from the generator, as follows: experiment HG1 with the 13k rows dataset and experiment HG2 with the 95k rows dataset.

    Active Learning configuration

    Experiment AL1: Random Pool experiments

    • Scale: Due to the need for representative density, we generated a massive pool of 16 million candidates. From this, we extracted the top n = 75,000 instances with the highest uncertainty.
    • Oracle Model: Gemini 2.5 Flash Lite (Standard Prompt).

    Experiment AL2: Boundary Points experiments

    • Budget: We capped the additional data points at n = 50,000.
    • Batch size: The model was retrained after every batch of k=100 new points to incrementally adjust the decision boundary. The Pool of tipping points was set to 2000.
    • Oracle model: Gemini 2.5 Flash Lite (Few-Shot Prompt). This configuration leveraged the contextual examples (m = 10 per class) described in the methodology to handle the nuance of boundary cases.

    Results

    Baseline

    To measure the true impact of our hybrid methodologies, we first established a performance benchmark using two distinct approaches, evaluating against the 20% holdout test set:

    • Model-based baseline: We trained our candidate architectures (LightGBM and Deep Learning v2) exclusively on real-world campaign data.
    • LLM-only baseline: For comparison, we tested the LLM’s raw predictive capability. Instead of training a classifier, we passed campaign features directly to the model using both zero-shot and few-shot prompts to see how well its internal knowledge aligns with our specific advertising domain.
    ModelOverall F1Neg F1Avg F1Pos F1
    Tree-based60%54%72%55%
    Deep Learning67%60%79%63%
    LLM (regular prompt)24%20%47%29%
    LLM (few-shot prompt)42%30%58%37%
    Performance Benchmark of Model-Based Baselines (Tree-based, Deep Learning) and LLM-Only Approaches (Zero-shot and Few-shot) on Real-World Campaign Data, measured by F1 Scores on a 20% Holdout Test Set.

    Given the highly imbalanced nature of our dataset, we used Macro F1 as our primary evaluation metric. Our baseline experiments yielded the following results:

    • Primary baseline: The Deep Learning (v2) architecture established our strongest benchmark at 67% Macro F1, outperforming the tree-based LightGBM approach.
    • LLM raw performance: Direct prompting of the LLM resulted in significantly lower scores, ranging from 24% to 42%. This performance gap suggests that while LLMs possess broad semantic understanding, they lack the specialised, latent domain knowledge required for precise campaign forecasting.

    With the 67% Macro F1 mark established as the target to beat, we evaluated our two LLM-enhanced methodologies: the Hybrid Graph and the Active Learning.

    Using the results from the Deep Learning model we have performed an error analysis by feature to confirm the performance distribution is relatively balanced across our main categories.

    Error Analysis by Gender
    Error Analysis by Country
    Error Analysis by Age Generation
    Error Analysis by Platform

    Hybrid Graph results

    Building a Hybrid Graph requires a careful calibration between our internal proprietary data and external LLM-generated features. The goal is to find an equilibrium where synthetic insights enhance, rather than overwhelm, the real-world signals.

    To test this, we experimented with varying graph densities, to see if a denser “knowledge web” would improve the model’s ability to separate performance classes. We observed a few key constraints during this phase:

    • Data integrity: At lower densities, it was computationally impossible to generate a synthetic dataset without introducing significant “noise” from the dataset generator.
    • The 60% threshold: We ultimately selected a 60% graph density. While this is relatively high, it struck the best balance: it provided enough connectivity for the model to learn complex relationships while still allowing for “missing edges,” mirroring the gaps and uncertainties found in real-world advertising environments.
    ExperimentVariantModelOverall F1Neg F1Avg F1Pos F1
    HG1.113k rows, 60% graph densityTree-based37%51%10%50%
    HG1.2Deep Learning17%11%3%37%
    HG2.195k rows, 60% graph densityTree-based42%57%10%60%
    HG2.2Deep Learning24%14%8%35%
    Performance Hybrid Graph Experiment variants on Real-World Campaign Data by different models, measured by F1 Scores.

    Despite the theoretical potential of the Hybrid Graph, the experimental results showed a severe performance degradation across all model architectures. Even when scaling the synthetic dataset to 95,000 rows, Macro F1 scores plummeted to between 17% and 42%, well below our 67% baseline.

    Our analysis identified two primary drivers for this decline:

    1. Data starvation (Deep Learning)

    The Deep Learning (v2) models performed the worst in this environment, with scores dropping to 17%-24%. Neural networks require significant data volume to converge on stable embeddings. The 13k-95k row range proved insufficient for the model to find meaningful patterns within the high-dimensional vector space, leading to poor representation and unstable predictions.

    2. Signal dilution (Tree-Based & overall)

    While the LightGBM models were more sample-efficient, managing slightly higher scores of 37% to 42%, they still failed to compete with the real-world baseline. We attribute this to how gradient boosting algorithms function:

    • The split problem: These models rely on identifying “hard” feature splits that represent true marketing correlations.
    • Generalised noise: By introducing LLM-generated graph edges, we essentially replaced nuanced, proprietary data with generalised, public-domain assumptions.
    • Generalisation failure: The tree models were forced to split on this “artificial noise,” which lacked the orthogonal predictive value needed to improve the model.

    Ultimately, attempting to blend this specific graph structure with real data simply reverted the model toward the baseline performance, confirming that the LLM-generated features did not provide the specialised insights required for a performance lift.

    Active Learning results

    Moving away from broad feature generation, we tested our Active Learning loops (Broad Point Search vs. Targeted Boundary Search). The goal was to find the optimal empirical blending ratio- the “Golden Ratio”, of real-world data to LLM-labelled data.

    ExperimentVariantModelOverall F1Neg F1Avg F1Pos F1
    AL1.1Broad Point Search Real data 60%, LLM data 40%Deep Learning65%57%78%60%
    AL1.2Broad Point Search Real data 70%, LLM data 30%Deep Learning66%59%77%62%
    AL1.3Broad Point Search Real data 80%, LLM data 20%Deep Learning66%59%77%62%
    AL2.1Targeted Point Search regular prompt Real data 78%, LLM data 21%Deep Learning66%59%79%61%
    AL2.1Targeted Point Search few-shot prompt Real data 78%, LLM data 21%Deep Learning68%60%80%63%
    Performance Active Learning Experiment variants on Real-World Campaign Data by different models, measured by F1 Scores.

    Our empirical tuning revealed that an approximate 80/20 ratio (Real to LLM data) provided the most stability, anchoring the model in proven historical data while allowing for minor decision-boundary exploration.

    The Targeted Point Search (Boundary Exploration) utilising a Few-Shot oracle prompt (Experiment AL2.1) achieved the highest Overall F1 at 68%, successfully surpassing our pure real-world baseline (67%) by a narrow margin.

    This confirms what the baseline benchmarking suggested: the LLM lacks native understanding of our specific dataset and its ambiguous “grey areas”. Incorporating a few-shot prompt simply feeds the LLM the same signals already present in the real data. While this overlap acts as a stabilising force that prevents signal dilution, it also explains why the LLM’s additive value is so minimal, resulting in merely a 1% lift over our established 67% baseline.

    Limitations & research constraints

    While our methodologies provided a framework for augmenting campaign data, we identified several critical limitations that define the boundaries of our current findings.

    The effectiveness of the Hybrid Graph is inherently tied to the alignment between proprietary data and external AI knowledge.

    • Response quality & divergence: The graph’s integrity depends on the LLM’s ability to generate high-quality, contextually accurate edges. We observed significant “disagreement” where LLM-generated logic conflicted with our real-world performance logs.
    • Pairwise simplification: Our architecture assumes that performance can be deconstructed into pairwise relationships between modalities (e.g., Platform vs. Geo). In reality, campaign success is often the result of complex, high-order interactions across all features simultaneously, which a simple graph structure may struggle to capture.

    In our Active Learning loops, the model is only as good as its teacher.

    • Label reliability: We are limited by the oracle’s (Gemini) ability to accurately label synthetic campaign combinations. If the LLM provides a hallucinated label for a complex boundary case, it introduces noise directly into the training set.
    • Weighted parity: Currently, our training pipeline treats synthetic labels from the LLM with the same weight as ground-truth labels from real campaigns. This assumes the LLM’s “reasoning” is as valid as an actual historical conversion, which may not always hold true in a shifting media landscape.

    Finally and perhaps the most significant constraint is the absence of creative data. While we have robust metadata regarding audiences, platforms, and other modalities, we lack the visual and auditory features of the ads themselves. In modern digital advertising, the “creative” is often the single most influential driver of engagement; omitting it means our models are operating without a primary piece of the performance puzzle.

    Future work & next steps

    To build on these findings and address the current limitations, our future research will focus on three primary areas:

    • Integrating creative intelligence: We aim to incorporate creative-level data back into the model. Since creative execution is often the strongest driver of campaign success, using vision-language models to extract features from ad imagery and copy will likely provide the “missing link” in our predictive accuracy.
    • Specialised oracles: Rather than using a general-purpose LLM, we plan to test a fine-tuned model specifically trained on historical marketing performance data. This specialised oracle would be used both to provide higher-quality labels in the Active Learning loop and to generate more accurate, domain-specific edges for the Hybrid Graph.
    • Real-world active learning: We intend to bridge the gap between simulation and reality by moving beyond a synthetic oracle. By implementing a live feedback loop, we can identify high-uncertainty campaign configurations, deploy them in real-world media environments, and use the actual performance results to update and refine our model in real-time.

    Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP Research team.

  • The uncharted territory: Beyond the known data

    Our prior research has confirmed a fundamental truth: Machine Learning is exceptionally good at finding patterns in existing data. By analysing thousands of past campaigns, these models identify the threads of success and can predict outcomes for similar strategies with high reliability. This is an incredibly powerful tool for optimising what we already know.

    However, the real-world often presents us with a different challenge: The Unknown.

    • The data gap: While digital marketing generates vast amounts of information, it is rarely “clean” or perfectly integrated. Furthermore, when launching a new product or entering an entirely new market, historical data is often scarce or non-existent.
    • The novelty gap: Traditional Machine Learning excels at spotting correlations, but it can struggle with the unprecedented. What happens when a novel creative concept emerges, or a sudden social trend shifts audience behaviour overnight? Because the model hasn’t “seen” these shifts in the past, it may lack the context to predict the future.

    Bridging the gap: Can AI enhance our data?

    At the heart of our latest research is a fundamental question:

    What happens when we merge our proprietary data with the vast, world-level knowledge of a Large Language Model (LLM)?

    While “adding AI” is a popular trend, real business value isn’t a given. We set out to discover if an LLM acts as a true force multiplier that fills in missing pieces, or if it simply repeats what we already know, or worse, introduces “noise” that clouds our judgment. To find the answer, we tested two distinct strategies to enhance our historical campaign data:

    • Hybrid Graph creation: We build a digital “web” that connects our internal campaign facts with the LLM’s external context. This allows us to map out relationships between brands and audiences that our internal data alone might have missed.
    • Active Learning: Think of this as a focused “tutoring” session. We use AI to identify the most confusing parts of our data. By putting an LLM in the loop to address these specific gaps, the model learns exactly where it can provide the most clarity.

    By testing these methodologies, we are aiming to answer a critical, industry-defining question: Is the secret to superior performance simply more LLM integration, or does the true value still reside in the expert knowledge of marketing professionals that WPP has established over the years?

    The path forward: Testing the synergy

    To determine if AI-driven insights translate into real-world business value, we put two distinct methodologies to the test: Hybrid Graph creation and Active Learning. Each approach ensures that the LLM isn’t just a passive observer, but an active contributor to our strategy.

    To test this synergy, we utilised a specialised export from WPP’s proprietary dataset, ensuring anonymity and privacy. This data captures the full lifecycle of a campaign, including information on audience characteristics, geographical location and platform-specific delivery settings. Crucially, each entry includes a definitive label indicating the campaign’s objective and its final outcome, allowing us to measure success with high precision. For a deeper dive into the architecture and specific variables of this dataset, please refer to the Campaign Intelligence Dataset Pod.

    Strategy I: Hybrid Graph creation – beyond the spreadsheet

    Instead of looking at data as a simple list, we treat it as a dynamic relationship map. Imagine this network as a constellation where every individual data point, whether it’s a demographic like ‘woman,’ a platform like ‘Facebook,’ or a region like ‘Spain’, becomes a node. These nodes are interconnected by lines that represent the strength and nature of their relationships.

    To visualise this concept, here is an example of how this relationship network is mapped in our Hybrid Graph:

    Hybrid Graph example: each node is an attribute from the real dataset. Nodes are positively connected (green) if they perform well together, negatively connected (red) otherwise. Not connected nodes indicate average or non existing relation.

    By combining our internal campaign history with the LLM’s broader world knowledge, we create a “Hybrid Graph”. With this multi-layered map, we expect to be able to see connections that traditional spreadsheets ignore and the process of building it unfolds in three strategic phases:

    Phase 1: Establishing the ground truth

    Before we ask an AI for help, we perform a “Deep Dive” into our historical data to separate coincidence from repeatable success through two rigorous tests:

    • The consistency test (purity): We look for a clear “verdict”. For example, if a specific audience and brand pairing resulted in a positive outcome 85% of the time, we have a reliable pattern. The data is giving us a clear “Yes”.
    • The volume test (cardinality): Consistency only matters if it happens often. For instance, if a positive connection appears consistently, we know it’s a statistically significant trend, not just a stroke of luck.

    By filtering through these lenses, we identify the bedrock of our dataset.

    Phase 2: The expert second opinion

    Next, we turn to the “Emerging Patterns”: combinations where the data shows a clear leaning (like a 60% success rate), but the evidence isn’t yet overwhelming. Historically, these “maybe” scenarios might have been ignored. Now, we invite the LLM to act as a Strategic Consultant.

    • The power of the upvote: When the LLM’s intuition aligns with our data’s hints, we gain a new level of confidence. For example, if a location and brand pairing resulted in a positive outcome 60% of the time, we suspect there is a pattern, but we need the LLM to confirm.
    • Validation through synergy: By getting a “Yes” from the AI to back up our data, we move these patterns from the “maybe” pile into our active knowledge base.

    Phase 3: Illuminating the “dark spots”

    Finally, we shift our focus to the areas our data couldn’t reach, the “dark spots”. These are combinations that were excluded because our data was too noisy or the scenarios were entirely new.

    We identify every combination where we currently lack confidence. However, this is a vast number of combinations, making it computationally infeasible to check all of them. That’s why we sample these gaps and ask the LLM for original insights based on its understanding of global markets. To give an example, the cases we’re targeting here look like: a specific audience and location pairing that is non-existent in the dataset, or a combination that results in a positive outcome half the time and a negative outcome the other half. Because of the lack of a clear pattern in the real data, we directly ask the LLM.

    In short, this allows us, through the LLM, to clarify noisy data and predict outcomes for entirely new scenarios.

    The challenge of illuminating our data “dark spots” led to a major breakthrough. Our first approach, the Hybrid Graph, essentially looks at low-signal campaign data and uses LLMs to make an educated guess to fill in the gaps. But this sparked a bigger, more strategic question: Instead of just guessing what’s in the dark, how can we actively hunt for the exact pieces of missing information that will make our predictions smarter? Out of millions of possible campaign combinations, how do we pinpoint the specific scenarios that will teach our model the most? This strategy of “smart hunting” forms the foundation of our second approach: Active Learning.

    Phase 4: Combining everything together

    The culmination of this research is a unified Hybrid Graph. By merging our proven history, our validated suspicions and our newly discovered insights, we create a living map of intelligence.

    The result is a specialised dataset that is expected to offer the best of both worlds:

    • The grounding of reality: Rooted in the hard facts of our actual campaign history.
    • The foresight of AI: Enhanced by the vast, contextual knowledge of the LLM.

    Strategy 2: Active Learning – solving the puzzle of uncertainty

    Where the Hybrid Graph fills gaps with new insights, Active Learning focuses on a different truth: data isn’t always helpful if it’s redundant. To truly advance our models, we don’t need more of what we already know; we need clarity in the “grey areas” of our knowledge.

    For example, imagine our data clearly shows that “TikTok campaigns” aimed at “Gen Z” are consistently successful, while “LinkedIn campaigns” aimed at “Millennials” usually underperform. But what happens if we want to run a “TikTok campaign” for “Millennials”? The model might be completely unsure if there is no clear pattern for that specific combination. Instead of analysing thousands more Gen Z campaigns we already understand, Active Learning specifically targets this exact missing combination. By resolving this one grey area, the model learns whether the platform or the audience age is the true driver of performance.

    In the world of data, this uncertainty occurs when there isn’t a strong, consistent signal: when parts of a dataset tell conflicting stories, making it difficult to separate real patterns from mere noise.

    In modern marketing, the number of possible combinations between audiences, brands and locations is astronomical; blindly analysing every single one would be incredibly slow, if not impossible. Instead, we use Active Learning as a strategic guide to identify the specific “pockets” of a dataset where our current models are struggling the most. It sifts through the records and picks only the most confusing, yet valuable, points for evaluation.

    By focusing our efforts strictly on the areas where the model is most uncertain, we achieve two major goals:

    • Maximised intelligence: We gain the most knowledge from the fewest possible data points.
    • Operational speed: We bypass the “noise” of what we already know, allowing us to build high-performing models in a fraction of the time.

    Ultimately, this approach turns a daunting, “infinite” dataset into a manageable, high-impact asset.

    The LLM as our “oracle”

    Identifying the most uncertain points in our data is only half the battle; the real value lies in what we do with them. Once we have selected these high-priority “grey areas,” we bring in the LLM to act as our oracle.

    Using sophisticated prompting techniques, we present these uncertain points to the AI for a professional verdict. Our goal is to transform these pockets of doubt into certainty, backed by high-quality expert information.

    By doing this, we effectively bridge the “information gap”. We aren’t just adding more data for the sake of volume; we are harvesting targeted knowledge. This process turns a previously unknown variable into a strategic asset, ensuring that our final model isn’t just a reflection of what we’ve seen before, but a fusion of our experience and the AI’s broader market expertise.

    Two paths to higher intelligence

    To find the most efficient way to “teach” our models, we experimented with multiple different strategies for choosing which questions to ask our LLM oracle. Below, we outline our core foundational technique and the more advanced method that has proven to be our most effective to date.

    Approach 1: The broad search

    This is a high-level “scouting” mission. We create a large pool of random potential campaign scenarios and ask our current model to predict how they would perform. We then identify the scenarios where the model is the most confused, the “shaky” predictions, and send those directly to the LLM oracle for a definitive answer. It’s a fast, effective way to shore up general weaknesses in our knowledge.

    Approach 2: The targeted stress test (our top performer)

    Our most successful approach is much more surgical. Instead of looking at random scenarios, we actively look for the “tipping points”, the exact moment a campaign shifts from being a success to a failure, or vice versa.

    • Finding the edge: We take a known successful campaign and a known failure, then subtly blend their features to create a new, “borderline” scenario.
    • Measuring confusion: We keep adjusting the features until a pre-trained auxiliary model (in this case, a tree-based one) flips its prediction. We then rank and select the scenarios where the outcome is most uncertain, ensuring we capture the most informative data points for our oracle to review.
    • The expert verdict: We present these precise “tipping points” to the LLM oracle. By giving the AI specific examples of similar successes and failures as context, we get an incredibly high-quality label.
    • Iterative learning: Once the LLM provides the answers for these “grey areas,” we integrate them into our official records. We then retrain our auxiliary model on this newly enriched dataset, making it instantly more precise. From there, the process begins again, creating a continuous loop that proactively hunts for and eliminates our model’s remaining blind spots.
    The Active Learning Loop with the four main phases: training existing data, finding the tipping points, labelling them using an LLM, and finally adding them to existing data to start the loop over.

    By repeating this process, we don’t just add data; we specifically “fix” the model’s most significant blind spots. This iterative loop ensures that our final engine isn’t just bigger but it’s also significantly smarter.

    Results

    The balancing act: Extracting the final datasets

    Building a Hybrid Graph is a delicate exercise in calibration. Our challenge was to find the perfect equilibrium: How much should we trust our internal data and how much “weight” should we give to the LLM’s external knowledge?

    To test this, we generated several different graph versions, eventually selecting the largest and most robust one. This ensured our Synthetic Data Generator had a dense enough “knowledge web” to create high-quality, non-random datasets. To keep our findings clear, we kept environmental “noise” to a minimum, ensuring we were testing the core intelligence of the graph itself.

    Similarly, when building the datasets to test our Active Learning strategies, we had to find the right blend of human experience and AI insight. After testing multiple configurations, we discovered our “Golden Ratio” was in the region of 80% Real-World Data and 20% LLM Knowledge. This 80/20 balance proved to be our most effective setting. It ensures the model remains firmly grounded in the proven reality of WPP’s historical success, while still allowing enough “AI intuition” to fill in the gaps and explore new strategic frontiers.

    The reality check: Lessons from the data

    To evaluate the results, we ran a “head-to-head” test. We trained one model using only real-world data and another one using our LLM-enhanced hybrid dataset. We then tested both against a “holdout” set of real campaign results.

    Here are the results of our models, trained on the real dataset and tested against the holdout:

    ModelOverall F1Neg F1Avg F1Pos F1
    Tree-based60%54%72%55%
    Deep Learning67%60%79%63%
    Models’ Baseline Performance (trained and tested on real-world data)

    Building on the foundations of our previous research (From guesswork to foresight: How AI is predicting the future of marketing campaigns), we transitioned our models from a controlled synthetic environment to the complexities of 100% real-world campaign data.

    Our standard models, which previously proved their strength in synthetic testing, delivered a highly competitive baseline. This “Reality Benchmark” set a high bar, while simultaneously identifying clear opportunities for our LLM-based techniques to add value.

    The results revealed a clear trend: while the models excelled at identifying “Average” campaigns, they struggled to pinpoint the extreme “Positive” or “Negative” outliers. This is a common phenomenon in real-world marketing. Unlike our controlled synthetic environments, where we can perfectly balance the ratios, real-world data is heavily weighted toward “average” outcomes. Exceptional successes and disasters are rare, making them significantly harder for a model to learn and predict.

    Within this context, Deep Learning (v2) emerged as our strongest baseline, achieving a solid 67% overall F1 score. The Tree-Based Approach performed slightly below the Deep Learning architecture, reinforcing our decision to move toward more “relational” neural networks to navigate the noise and imbalance of complex marketing datasets.

    By establishing this 67% mark as our “Line in the Sand,” we can clearly measure the true impact of our Hybrid Graph and Active Learning interventions. Here is how our LLM-enhanced methodologies performed:

    MethodVariantModelOverall F1Neg F1Avg F1Pos F1
    Hybrid
    Graph
    13k rows 60% densityTree-based37%51%10%50%
    Deep Learning17%11%2%37%
    90k rows 60% densityTree-based42%57%10%60%
    Deep Learning24%14%8%35%
    Active LearningBroad Point Search
    Real 80%, LLM 20%
    Deep Learning66%59%77%62%
    Targeted Point Search
    Real 78%, LLM 21%
    Deep Learning68%60%80%63%
    Hybrid Graph & Active Learning Performance
    Performance comparison of the different models and experiment by metrics.

    The findings in the table above were unexpected, but deeply insightful: We noticed a significant drop in performance when the LLM was added to the loop for hybrid graph and almost no increase with active learning.

    The hybrid graph challenge: A significant divergence

    The most striking finding was the performance of the Hybrid Graph. Despite increasing the data volume to 95k rows, the scores dropped significantly, bottoming out at 17% to 42%.

    This drop reveals a fundamental truth: generic LLMs are trained on public domain knowledge. They lack the specialised, proprietary marketing intelligence that WPP possesses. By weaving general AI “intuition” into a specialised graph, we introduced noise that actively diluted the high-quality signals of our real-world data.

    Even with a denser graph, our models struggled to maintain a consistent F1 score. This proves that marketing success hinges on the niche, proprietary data unique to our field, information that simply isn’t available in the public sphere, rather than just a larger volume of generic data.

    Active Learning: Reaching the efficiency frontier

    In contrast, our Active Learning strategies, specifically the Targeted Point Search, successfully met the benchmark. Using our “Golden Ratio” (78% Real / 21% LLM), the Targeted Point Search achieved a 68% F1 score, slightly outperforming our best real-world baseline.

    While our Targeted Point Search allowed us to maintain performance levels comparable to our 100% real-world baseline, we have to be honest: we expected a more significant leap. To justify a process of this complexity, the “performance lift” needs to be undeniable. This brings us to two critical, strategic questions:

    • The quality risk: For such a marginal improvement in accuracy, is it worth introducing external AI “intuition” into our proprietary ecosystem when we cannot be 100% certain of its quality?
    • The computational cost: Does the slight increase in predictive power justify the high computational expense and the mathematical difficulty of hunting for these “tipping points”?

    In its current state, the answer is a cautious “No”. While the technology is fascinating, the results prove that our internal, high-fidelity data is already doing the heavy lifting. Introducing expensive, public-model “noise” for a 1% gain doesn’t just challenge our efficiency, it risks diluting the “Gold Standard” intelligence that WPP already possesses.

    The strategic conclusion: Expert-led AI

    Our research serves as a powerful reminder that AI is a force multiplier, not a replacement. The performance drop we saw with the Hybrid Graph Dataset underlines the immense competitive advantage of WPP’s proprietary data; generic models simply cannot replicate the “niche” intelligence we already possess.

    While Active Learning was able to match our 67% baseline, “matching” the status quo is not enough to justify the hype or the computational cost.

    The core insight: Data quality is the ultimate moat

    This research proves a fundamental truth: Data quality is everything. A generic AI cannot replace the deep, specialised expertise of a marketing professional. The failure of the “public” LLM to improve our results demonstrates that the real path to success lies in keeping our experts in the loop. By using high-fidelity, professional strategy rather than general internet trends, we ensure our models are learning from the best in the business.

    Moving forward: From baseline to breakthrough

    To bridge the gap between “adequate” and “exceptional,” we have identified two clear technical paths to evolve this research:

    1. The fine-tuned oracle: Our current experiments used “off-the-shelf” LLMs. To truly elevate the results, the next logical step is to use fine-tuned models: AIs that have been specifically trained on WPP’s historical successes and internal playbooks. This transforms the oracle from a generalist into a marketing specialist.
    2. Real-world Active Learning: The ultimate validation of Active Learning isn’t a digital oracle; it’s the market itself. A Real-Time Loop can be implemented using Active Learning, to identify high-potential “blind spots,” launching those as live test campaigns and then feeding that real-world performance back into our models. This moves us from theoretical testing to real-world evolution.

    Ready to explore the specifics? Read our full technical deep dive into Data Enrichment Pod for a closer look at our methodology.

    Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP Research team.