Machine Learning excels at identifying historical patterns, but its predictive power often hits a wall when faced with entirely new products or shifting audience behaviors where data is scarce. We investigated if Large Language Models (LLMs) could bridge this gap by augmenting our proprietary campaign data with LLM-generated insights, employing Hybrid Graph creation and Active Learning strategies. However, integrating generic LLM knowledge often introduced label noise, either significantly degrading performance or yielding only a negligible 1% improvement. This demonstrates that high-fidelity, proprietary data remains our strongest predictive asset, and future AI integration requires domain-specific fine-tuning rather than relying on the broad knowledge of off-the-shelf LLMs.

If you don’t care about the technical details, read our blog post instead.

Can LLMs predict campaign success? Using hybrid data techniques to bridge data gaps

The core question driving our research is whether Large Language Models (LLMs) can meaningfully improve the prediction of real-world campaign performance.

Can LLMs provide the latent domain knowledge necessary to bridge gaps in existing campaign data without causing signal dilution?

To test this, we developed and compared two distinct methodologies:

Hybrid Graph construction: We extracted high-confidence relationships from our real campaign data and used an LLM to “fill in the blanks,” creating a unified knowledge graph. This graph was then used to generate augmented datasets for model training.
Active Learning with an LLM oracle: We implemented a targeted feedback loop where our model identifies its own “areas of confusion”, which are subspaces where it lacks confidence. We then used an LLM as an expert “oracle” to provide specialised labels for these high-value points, strategically refining the training set.

Datasets & preparation

The dataset

The foundation of our analysis is a dataset extracted from real-world campaign data provided by WPP. This data tracks the daily performance of various brand campaigns, capturing core metrics like clicks, conversions, and views.

Beyond performance, the dataset includes granular data for each campaign, such as:

Targeting: Audience segments and geographic locations.
Environment: Platform (e.g., Social, Search) and device type.
Strategy: The specific objective of the campaign and delivery settings.

To manage the complexity of the dataset and ensure our models capture the relationships between different types of information, we organised our features into modalities. Rather than treating every column as a flat list, we grouped related variables into distinct “pieces of the puzzle”. The primary modalities we defined include:

Audience: gender, age group, generation, interests, education (major and schools), industries, work (positions and employers), behaviours, relationship statuses and life events. Additionally, there are custom audiences available and interests that are excluded from targeting.
Brand
Creative: image or video used in the campaign
Geo: geographic markers ranging from countries to zip codes.
Platform: platform, placement and device
Objective: delivery setting of the campaign

Initially, we intended to include creative-level data. However, we found a high volume of missing values in this field. To maintain a larger, more robust sample size for the models, we decided to omit creatives from this iteration of the project.

Data labelling

We assigned each campaign a performance label (Positive, Average, or Negative), to serve as our ground truth. While we generated separate label sets for clicks and conversions to maintain flexibility, we chose engagement as the primary driver for our current classification tasks.

To ensure these labels reflect true performance rather than just the size of the budget, we moved away from raw totals. Instead, our labelling methodology relies on two key factors:

Efficiency weighting: All metrics are weighted by spend. This allows us to identify campaigns that are over-indexed on performance relative to their cost.
Quantile distribution: Labels are assigned based on where a campaign falls within the broader distribution of the dataset. This categorical approach helps the model distinguish between “high performers” and “underperformers” within a normalised context.

For a deeper dive into the specific weighting formulas and the statistical thresholds used for these quantiles, you can refer to the Campaign Intelligence Data Pod .

Data preprocessing

To ensure our research remained focused and the results highly relevant, we narrowed our scope specifically to video campaigns. Video data offers a rich set of engagement metrics and distinct delivery patterns that differ significantly from static display or search ads.

Before training and testing our models, we perform a preprocessing workflow. Raw campaign data is rarely model-ready; it requires specific transformations to turn disparate data points into a structured format suitable for encoding. To maintain the modular structure of our data, we process each modality by concatenating its core attributes into unified strings or arrays. This approach helps the model understand the context of each data “piece” as a whole.

Modality refinement

Audience: We narrowed our focus to Gender, Generation, and Interests/Attributes. Within “Interests and Attributes,” we bundled together specific fields like education, industry, and job titles. We intentionally excluded other audience categories that either lacked clear business value or suffered from high rates of missing values.
Platform: This modality combines the platform (e.g., Instagram), the placement (e.g., Stories), and the device (e.g., Android Mobile) into a single, descriptive feature.
Objective: We concatenated all delivery settings and optimisation goals associated with a campaign to capture the intended strategy.
Geo: We focused on the country level. In cases where country data was missing but region or city data was present, we used a custom mapping tool to retrieve and fill in the corresponding country.
Brand: This includes specific advertiser identifiers and category markers.

Aggregation

Once the columns are selected and formatted, we perform a final aggregation to eliminate any duplicate entries. In cases where multiple entries exist for the same campaign parameters, we take the mean label to ensure the target variable accurately reflects the consensus performance.

Dataset analysis and exploration

After completing the preprocessing and aggregation pipeline, our final dataset comprises 668,871 rows. This represents a diverse cross-section of global advertising, covering 339 distinct brands across 157 countries.

With a dataset of this size, there is a risk that a model might “cheat” by memorising specific feature values that happen to correlate with a label, rather than learning the underlying relationships between variables. For example, if a specific platform consistently shows a “Positive” label simply due to how data was collected, the model might become biased toward that platform regardless of the actual campaign performance.

To prevent this, we conducted a distribution analysis:

Label distribution per feature: We plotted the distribution of labels (Positive, Average, Negative) against individual feature values.
Bias detection: We specifically looked for “leaky” features or skewed distributions that would give the model an unfair hint.
Verification: By ensuring that labels are relatively balanced across our main categories, we confirm that the model must rely on the interaction of multiple modalities, rather than a single biased variable, to make an accurate prediction.

This step ensures that the resulting insights are based on genuine performance drivers rather than statistical noise or data collection artefacts.

Representative visualisations of the label distribution for some of the main features:

Encoding inputs

Since our processed dataset is largely composed of text, we face a fundamental challenge: machine learning models cannot process raw language. To bridge this gap, we must convert these text fields into a numerical format that the algorithm can interpret.

For this project, we are using the text-embedding-005 model (part of the Google Gemini family) to handle all text-based fields.

Rather than using simple “one-hot encoding” (which treats words as isolated categories), an embedding model transforms text into a high-dimensional vector. This process captures the semantic relationship between different values. For example, it allows the model to “understand” that “Instagram Stories” and “Facebook Reels” are more similar to each other than they are to “Desktop Search,” even if the raw text doesn’t match exactly.

The steps taken to get the encoding are:

Input: The concatenated strings from our modalities (e.g., “Meta | Mobile | Stories”).
Transformation: The embedding model maps these strings into a fixed-length numerical array and based on task (semantic similarity).
Result: These vectors serve as the “features” that our model uses to identify patterns and predict campaign performance.

By using a state-of-the-art embedding model, we ensure that the nuances of our campaign metadata are preserved, providing the predictive model with a rich, context-aware starting point.

Methodology

To test our hypotheses, we explored two distinct approaches. Our goal was to determine if LLMs could effectively augment real-world performance data to fill gaps in the campaign landscape.

Strategy I: Hybrid Graph construction

To model the complex relationships in our data, we built a signed undirected graph. In this structure, nodes represent specific campaign attributes (like “Female” or “Instagram”), and edges represent the relationships between them. These edges are signed: a positive edge indicates a successful performance link, while a negative edge indicates an underperforming one.

Node selection

The challenge was defining these nodes so they could seamlessly accommodate both real-world campaign results and synthetic insights from an LLM. We achieved this through a three-step process:

1. Strategic node definition

We mapped distinct entities from our dataset into graph nodes using a mix of aggregation and expansion:

Direct mapping: Standard categories like Brand, Country, Device, and Platform were assigned individual nodes.
Informed profiles: We merged Gender and Generation into single nodes (e.g., “Gen Z, Female”) to create more robust audience profiles.
Granular objectives: While our earlier preprocessing concatenated delivery settings into one string, we exploded these back into individual components for the graph. This allows the model to capture fine-grained relationships between specific strategies (like bidding types) and performance.

2. Modality trimming for efficiency

While it’s tempting to include every data point, we had to balance detail with computational reality. Generating LLM-based edges for every possible combination would be incredibly time-consuming.

Furthermore, we found that LLMs struggle with highly “niche” technical fields, such as specific targeting relaxation settings. To maintain accuracy and efficiency, we trimmed our focus to high-impact modalities with strong business relevance, including gender & generation, interests & other attributes, brand, geo, platform, device, campaign buying type, campaign objective, media buy billing event, media buy cost model and brand safety content filter levels.

3. Managing high-dimensionality (clustering)

Certain modalities, particularly Audience Interests, contained thousands of unique values (e.g., specific job titles or education majors). To prevent the graph from becoming unmanageable, we used K-Means clustering to group these into a controlled number of clusters.

Because automated clustering can sometimes group items based on superficial traits like the language the text was written in, we used an LLM to name these clusters and performed a manual review. This ensured that a cluster actually represented a coherent segment (e.g., “Tech Professionals”) rather than just a collection of similar-looking strings.

Edge selection

Once the nodes are defined, the next step is determining the edges between them. We focus on pre-selected pairs of modalities that carry the most business impact. For each pair, we follow a three-step sequential pipeline to decide which relationships are strong enough to enter our hybrid graph.

Phase 1: High-confidence direct edge promotion

First, we look at the real-world dataset to find undisputed patterns. We aggregate the data by modality pairs (e.g., Platform and Brand) and count the occurrences of each performance label. To ensure we only promote the most reliable relationships, we use two filtering metrics:

Cardinality: This represents the raw volume of evidence for the most frequent performance label. By requiring a large number of occurrences for the winning label, we ensure that the relationship is backed by a significant sample size rather than a few lucky instances.
Purity: This measures the strength of the consensus. It is the percentage of times the winning label appears out of all total observations for that specific combination. A high purity (e.g. >= 85%) ensures the data isn’t split or “noisy,” showing a clear, dominant trend.

If a combination meets both thresholds, it is directly promoted to the graph. We promote edges with “Positive” or “Negative” outcomes only, discarding “Average” labels to keep the graph’s signals clear.

Phase 2: LLM-validated edges

Next, we address the “middle ground,” which represents relationships where a pattern exists in the data but isn’t quite strong enough for direct promotion. We’ve experimented we multiple thresholds, ending with purity ≥ 55% as our selection.

For these edges, we use an LLM as a tiebreaker. We present the candidate edge to the LLM and ask for its opinion on whether the combination is good or bad. If the LLM’s response aligns with our data’s majority label, the edge is promoted. This step acts as a sanity check, using the LLM’s broader context to confirm subtle trends found in our dataset.

Phase 3: Synthetic edges

The final step involves edges that are either entirely missing from our dataset or are highly uncertain (Purity < 55%). Since the total number of possible combinations is massive, we use a strategic sampling approach:

Graph Density: we sample a subset of possible combinations based on a predefined density parameter.
Proportional Correction: If certain modalities had low acceptance rates in the previous two steps, we increase their representation in this sample to ensure the graph remains balanced.

For every sampled edge, we ask the LLM to predict the sign (Positive/Negative). These purely synthetic insights are then added to the graph.

Hybrid dataset generation

By the end of this process, we have a Hybrid Graph that blends high-certainty real-world data, data-driven patterns validated by AI, and purely synthetic logical bridges.

This allows us to produce a robust, augmented dataset that we can use to train our final models, evaluating whether this hybrid methodology improves predictive performance compared overusing raw data alone.

While the Hybrid Graph uses LLMs to label sparse, low-signal regions, it is fundamentally a passive data augmentation strategy. This prompted a shift in our objective: rather than arbitrarily imputing missing data, how can we optimise our query strategy to sample the most informative instances? Identifying the specific data points in the feature space that yield the highest expected information gain forms the basis of our second approach: Active Learning.

Strategy II: Active Learning

Beyond the static graph, we implemented an Active Learning loop to optimise a targeted knowledge expansion. Instead of labelling and training on a massive, randomly selected dataset, this paradigm identifies the specific “subspaces” where the model is weakest: areas of high uncertainty. The core philosophy is efficiency: by querying an “oracle” (in our case, an LLM) to label only these high-value points, we can significantly improve the model’s performance across the entire distribution using far fewer examples.

To find these high-value candidates, we first had to define a “Domain”, a delimited space of valid feature combinations. To keep this search space manageable and ensure the data remained realistic, we simplified our feature set:

Geography: We restricted candidates to single-country rows.
Interests: We used the semantic clusters created for our graph rather than the raw, high-volume interest list.
Demographics: To mirror real-world targeting, we converted generations into booleans (e.g., Millennial: Yes/No), allowing multiple generations to coexist. Gender was standardised to three specific nodes: Female, Male, and All.
Platform & Device: These were separated into distinct, individual attributes.
Objective: We treated the campaign objective as a single, unified block. This prevented the system from generating illogical combinations, such as pairing a “Link Clicks” objective with an incompatible cost model like “Page Engagement”.

By narrowing the domain this way, we ensured the LLM oracle was only labelling campaign scenarios that could actually exist in a real-world media plan. To determine the most effective way to refine our model, we tested three distinct strategies for selecting and labelling data points.

Method 1: Library-based optimisation (BoFire)

Initially, we attempted to leverage BoFire, a library designed for Bayesian optimisation. While it offers robust strategies for active learning, we encountered significant implementation barriers that made it impractical for our specific use case:

Memory overhead: The library’s core strategies consumed excessive memory, which did not scale with our dataset size.
Categorical constraints: Applying constraints to categorical features proved difficult outside of purely random approaches.
Scalability issues: Even lighter strategies (like SOBO) paired with Random Forest backbones consistently triggered Out-Of-Memory (OOM) errors. These limitations led us to pivot toward custom implementations.

Method 2: Random Pool strategy

To move past these library constraints, we developed a custom Random Pool approach. This method focused on broad exploration of the campaign space:

Candidate generation: We created a pool of synthetic candidates by randomly combining feature values from our existing data domain.
Uncertainty sampling: After training a model on our baseline data, we used it to predict the performance of these synthetic candidates.
Selection & labelling: We identified the top n candidates where the model showed the highest uncertainty. These were then sent to our Google Gemini oracle to be labelled and integrated into the training set.

Method 3: Boundary Point exploration

While the Random Pool offered broad coverage, generating and predicting millions of combinations is computationally expensive. To maximize the value of every oracle query, we transitioned to the “Boundary Point Exploration” strategy. This method targets the “edges” between performance classes, the exact points where the model’s decision-making is most fragile.

The workflow includes the following steps:

Initialisation: We begin by training a model on the current dataset and randomly selecting reference rows from each class (Positive, Negative, Average), as starting points.
Perturbation: We attempt to “flip” a data point from one class to another. For example, starting with a “Positive” campaign, we iteratively replace its feature values with those from a “Negative” campaign.
Boundary detection: We query the model at every step of this transformation. The moment the prediction flips (e.g., from Positive to Negative), we identify that specific configuration as a boundary point: a data point lying directly on the model’s decision margin.
Uncertainty sorting: We calculate the Shannon entropy for these boundary points to isolate the top K examples with the highest uncertainty.
Contextual labelling (Few-Shot oracle): To ensure high-quality labels for these tricky boundary cases, we use a Few-Shot Learning prompt. We provide the Gemini oracle with the m closest real-world rows from our dataset for each class as context. By seeing these real examples and their ground-truth outcomes, the LLM can calibrate its answers to our specific data distribution.
The loop: These high-value, context-aware candidates are integrated into the training set. This cycle repeats until we reach our total labelling budget n.

Experiments

To validate our methodologies, we compared two high-performing architectures from our previous research: LightGBM (LGBM), a tree-based gradient boosting model, and a custom Deep Learning (v2) model.

Our experimentation setup included:

Validation: We used a standard 80/20 train-test split, ensuring a 20% holdout for final evaluation on unseen data.
Hyper-parameter tuning: We utilized Optuna for automated hyper-parameter tuning. By maintaining the same tuning parameters as our previous benchmarks (e.g., learning rate and tree depth), we ensured a controlled environment, which has been proven to work in the past.

A detailed configuration per methodology is provided below.

Hybrid Graph configuration

High-confidence edges metrics: To consider an edge highly confident we require purity ≥ 85% and cardinality ≥ 20.
Confident edges to be LLM approved metrics: To consider an edge confident and get the LLM as the final approver, we require purity ≥ 55% and cardinality ≥ 5.
Graph density: 60%
LLM batch size: 40

For all relevant experiments, we use the graph with the aforementioned configuration and we separate them based on the numbers of rows requested from the generator, as follows: experiment HG1 with the 13k rows dataset and experiment HG2 with the 95k rows dataset.

Active Learning configuration

Experiment AL1: Random Pool experiments

Scale: Due to the need for representative density, we generated a massive pool of 16 million candidates. From this, we extracted the top n = 75,000 instances with the highest uncertainty.
Oracle Model: Gemini 2.5 Flash Lite (Standard Prompt).

Experiment AL2: Boundary Points experiments

Budget: We capped the additional data points at n = 50,000.
Batch size: The model was retrained after every batch of k=100 new points to incrementally adjust the decision boundary. The Pool of tipping points was set to 2000.
Oracle model: Gemini 2.5 Flash Lite (Few-Shot Prompt). This configuration leveraged the contextual examples (m = 10 per class) described in the methodology to handle the nuance of boundary cases.

Results

Baseline

To measure the true impact of our hybrid methodologies, we first established a performance benchmark using two distinct approaches, evaluating against the 20% holdout test set:

Model-based baseline: We trained our candidate architectures (LightGBM and Deep Learning v2) exclusively on real-world campaign data.
LLM-only baseline: For comparison, we tested the LLM’s raw predictive capability. Instead of training a classifier, we passed campaign features directly to the model using both zero-shot and few-shot prompts to see how well its internal knowledge aligns with our specific advertising domain.

Model	Overall F1	Neg F1	Avg F1	Pos F1
Tree-based	60%	54%	72%	55%
Deep Learning	67%	60%	79%	63%
LLM (regular prompt)	24%	20%	47%	29%
LLM (few-shot prompt)	42%	30%	58%	37%

Performance Benchmark of Model-Based Baselines (Tree-based, Deep Learning) and LLM-Only Approaches (Zero-shot and Few-shot) on Real-World Campaign Data, measured by F1 Scores on a 20% Holdout Test Set.

Given the highly imbalanced nature of our dataset, we used Macro F1 as our primary evaluation metric. Our baseline experiments yielded the following results:

Primary baseline: The Deep Learning (v2) architecture established our strongest benchmark at 67% Macro F1, outperforming the tree-based LightGBM approach.
LLM raw performance: Direct prompting of the LLM resulted in significantly lower scores, ranging from 24% to 42%. This performance gap suggests that while LLMs possess broad semantic understanding, they lack the specialised, latent domain knowledge required for precise campaign forecasting.

With the 67% Macro F1 mark established as the target to beat, we evaluated our two LLM-enhanced methodologies: the Hybrid Graph and the Active Learning.

Using the results from the Deep Learning model we have performed an error analysis by feature to confirm the performance distribution is relatively balanced across our main categories.

Hybrid Graph results

Building a Hybrid Graph requires a careful calibration between our internal proprietary data and external LLM-generated features. The goal is to find an equilibrium where synthetic insights enhance, rather than overwhelm, the real-world signals.

To test this, we experimented with varying graph densities, to see if a denser “knowledge web” would improve the model’s ability to separate performance classes. We observed a few key constraints during this phase:

Data integrity: At lower densities, it was computationally impossible to generate a synthetic dataset without introducing significant “noise” from the dataset generator.
The 60% threshold: We ultimately selected a 60% graph density. While this is relatively high, it struck the best balance: it provided enough connectivity for the model to learn complex relationships while still allowing for “missing edges,” mirroring the gaps and uncertainties found in real-world advertising environments.

Experiment	Variant	Model	Overall F1	Neg F1	Avg F1	Pos F1
HG1.1	13k rows, 60% graph density	Tree-based	37%	51%	10%	50%
HG1.2		Deep Learning	17%	11%	3%	37%
HG2.1	95k rows, 60% graph density	Tree-based	42%	57%	10%	60%
HG2.2		Deep Learning	24%	14%	8%	35%

Performance Hybrid Graph Experiment variants on Real-World Campaign Data by different models, measured by F1 Scores.

Despite the theoretical potential of the Hybrid Graph, the experimental results showed a severe performance degradation across all model architectures. Even when scaling the synthetic dataset to 95,000 rows, Macro F1 scores plummeted to between 17% and 42%, well below our 67% baseline.

Our analysis identified two primary drivers for this decline:

1. Data starvation (Deep Learning)

The Deep Learning (v2) models performed the worst in this environment, with scores dropping to 17%-24%. Neural networks require significant data volume to converge on stable embeddings. The 13k-95k row range proved insufficient for the model to find meaningful patterns within the high-dimensional vector space, leading to poor representation and unstable predictions.

2. Signal dilution (Tree-Based & overall)

While the LightGBM models were more sample-efficient, managing slightly higher scores of 37% to 42%, they still failed to compete with the real-world baseline. We attribute this to how gradient boosting algorithms function:

The split problem: These models rely on identifying “hard” feature splits that represent true marketing correlations.
Generalised noise: By introducing LLM-generated graph edges, we essentially replaced nuanced, proprietary data with generalised, public-domain assumptions.
Generalisation failure: The tree models were forced to split on this “artificial noise,” which lacked the orthogonal predictive value needed to improve the model.

Ultimately, attempting to blend this specific graph structure with real data simply reverted the model toward the baseline performance, confirming that the LLM-generated features did not provide the specialised insights required for a performance lift.

Active Learning results

Moving away from broad feature generation, we tested our Active Learning loops (Broad Point Search vs. Targeted Boundary Search). The goal was to find the optimal empirical blending ratio- the “Golden Ratio”, of real-world data to LLM-labelled data.

Experiment	Variant	Model	Overall F1	Neg F1	Avg F1	Pos F1
AL1.1	Broad Point Search Real data 60%, LLM data 40%	Deep Learning	65%	57%	78%	60%
AL1.2	Broad Point Search Real data 70%, LLM data 30%	Deep Learning	66%	59%	77%	62%
AL1.3	Broad Point Search Real data 80%, LLM data 20%	Deep Learning	66%	59%	77%	62%
AL2.1	Targeted Point Search regular prompt Real data 78%, LLM data 21%	Deep Learning	66%	59%	79%	61%
AL2.1	Targeted Point Search few-shot prompt Real data 78%, LLM data 21%	Deep Learning	68%	60%	80%	63%

Performance Active Learning Experiment variants on Real-World Campaign Data by different models, measured by F1 Scores.

Our empirical tuning revealed that an approximate 80/20 ratio (Real to LLM data) provided the most stability, anchoring the model in proven historical data while allowing for minor decision-boundary exploration.

The Targeted Point Search (Boundary Exploration) utilising a Few-Shot oracle prompt (Experiment AL2.1) achieved the highest Overall F1 at 68%, successfully surpassing our pure real-world baseline (67%) by a narrow margin.

This confirms what the baseline benchmarking suggested: the LLM lacks native understanding of our specific dataset and its ambiguous “grey areas”. Incorporating a few-shot prompt simply feeds the LLM the same signals already present in the real data. While this overlap acts as a stabilising force that prevents signal dilution, it also explains why the LLM’s additive value is so minimal, resulting in merely a 1% lift over our established 67% baseline.

Limitations & research constraints

While our methodologies provided a framework for augmenting campaign data, we identified several critical limitations that define the boundaries of our current findings.

The effectiveness of the Hybrid Graph is inherently tied to the alignment between proprietary data and external AI knowledge.

Response quality & divergence: The graph’s integrity depends on the LLM’s ability to generate high-quality, contextually accurate edges. We observed significant “disagreement” where LLM-generated logic conflicted with our real-world performance logs.
Pairwise simplification: Our architecture assumes that performance can be deconstructed into pairwise relationships between modalities (e.g., Platform vs. Geo). In reality, campaign success is often the result of complex, high-order interactions across all features simultaneously, which a simple graph structure may struggle to capture.

In our Active Learning loops, the model is only as good as its teacher.

Label reliability: We are limited by the oracle’s (Gemini) ability to accurately label synthetic campaign combinations. If the LLM provides a hallucinated label for a complex boundary case, it introduces noise directly into the training set.
Weighted parity: Currently, our training pipeline treats synthetic labels from the LLM with the same weight as ground-truth labels from real campaigns. This assumes the LLM’s “reasoning” is as valid as an actual historical conversion, which may not always hold true in a shifting media landscape.

Finally and perhaps the most significant constraint is the absence of creative data. While we have robust metadata regarding audiences, platforms, and other modalities, we lack the visual and auditory features of the ads themselves. In modern digital advertising, the “creative” is often the single most influential driver of engagement; omitting it means our models are operating without a primary piece of the performance puzzle.

Future work & next steps

To build on these findings and address the current limitations, our future research will focus on three primary areas:

Integrating creative intelligence: We aim to incorporate creative-level data back into the model. Since creative execution is often the strongest driver of campaign success, using vision-language models to extract features from ad imagery and copy will likely provide the “missing link” in our predictive accuracy.
Specialised oracles: Rather than using a general-purpose LLM, we plan to test a fine-tuned model specifically trained on historical marketing performance data. This specialised oracle would be used both to provide higher-quality labels in the Active Learning loop and to generate more accurate, domain-specific edges for the Hybrid Graph.
Real-world active learning: We intend to bridge the gap between simulation and reality by moving beyond a synthetic oracle. By implementing a live feedback loop, we can identify high-uncertainty campaign configurations, deploy them in real-world media environments, and use the actual performance results to update and refine our model in real-time.

Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP Research team.

Authors

Jael Freixanet

Jael is a Data Scientist at Satalia, leveraging her physics background for a deep analytical foundation in complex systems analysis and modeling. Her experience, spanning foundational research in computational physics and a proven track record in data science consultancy, provides a unique perspective for architecting robust, scalable models in intricate environments. In Satalia’s Research Lab, she bridges scientific methodology with industrial innovation to address WPP’s most sophisticated data challenges. Her current research focuses on multimodal fusion models, aiming to improve campaign performance and pioneer state-of-the-art machine learning.
Eirini Kolimatsi

Eirini is a Data Scientist at Satalia with a multidisciplinary background in Management Science and Computer Science. She specialises in architecting end-to-end data science solutions, leveraging a deep technical toolkit to solve complex industrial challenges across diverse sectors. Known for bridging the gap between theoretical research and scalable application, she focuses on delivering high-impact models that translate abstract data patterns into actionable strategic intelligence.

Her current research focuses on sophisticated campaign performance multimodal modelling and the development of data enrichment frameworks to maximise predictive accuracy.

Data Enrichment Pod