Our prior research has confirmed a fundamental truth: Machine Learning is exceptionally good at finding patterns in existing data. By analysing thousands of past campaigns, these models identify the threads of success and can predict outcomes for similar strategies with high reliability. This is an incredibly powerful tool for optimising what we already know.
However, the real-world often presents us with a different challenge: The Unknown.
- The data gap: While digital marketing generates vast amounts of information, it is rarely “clean” or perfectly integrated. Furthermore, when launching a new product or entering an entirely new market, historical data is often scarce or non-existent.
- The novelty gap: Traditional Machine Learning excels at spotting correlations, but it can struggle with the unprecedented. What happens when a novel creative concept emerges, or a sudden social trend shifts audience behaviour overnight? Because the model hasn’t “seen” these shifts in the past, it may lack the context to predict the future.
Bridging the gap: Can AI enhance our data?
At the heart of our latest research is a fundamental question:
What happens when we merge our proprietary data with the vast, world-level knowledge of a Large Language Model (LLM)?
While “adding AI” is a popular trend, real business value isn’t a given. We set out to discover if an LLM acts as a true force multiplier that fills in missing pieces, or if it simply repeats what we already know, or worse, introduces “noise” that clouds our judgment. To find the answer, we tested two distinct strategies to enhance our historical campaign data:
- Hybrid Graph creation: We build a digital “web” that connects our internal campaign facts with the LLM’s external context. This allows us to map out relationships between brands and audiences that our internal data alone might have missed.
- Active Learning: Think of this as a focused “tutoring” session. We use AI to identify the most confusing parts of our data. By putting an LLM in the loop to address these specific gaps, the model learns exactly where it can provide the most clarity.
By testing these methodologies, we are aiming to answer a critical, industry-defining question: Is the secret to superior performance simply more LLM integration, or does the true value still reside in the expert knowledge of marketing professionals that WPP has established over the years?
The path forward: Testing the synergy
To determine if AI-driven insights translate into real-world business value, we put two distinct methodologies to the test: Hybrid Graph creation and Active Learning. Each approach ensures that the LLM isn’t just a passive observer, but an active contributor to our strategy.
To test this synergy, we utilised a specialised export from WPP’s proprietary dataset, ensuring anonymity and privacy. This data captures the full lifecycle of a campaign, including information on audience characteristics, geographical location and platform-specific delivery settings. Crucially, each entry includes a definitive label indicating the campaign’s objective and its final outcome, allowing us to measure success with high precision. For a deeper dive into the architecture and specific variables of this dataset, please refer to the Campaign Intelligence Dataset Pod.
Strategy I: Hybrid Graph creation – beyond the spreadsheet
Instead of looking at data as a simple list, we treat it as a dynamic relationship map. Imagine this network as a constellation where every individual data point, whether it’s a demographic like ‘woman,’ a platform like ‘Facebook,’ or a region like ‘Spain’, becomes a node. These nodes are interconnected by lines that represent the strength and nature of their relationships.
To visualise this concept, here is an example of how this relationship network is mapped in our Hybrid Graph:

By combining our internal campaign history with the LLM’s broader world knowledge, we create a “Hybrid Graph”. With this multi-layered map, we expect to be able to see connections that traditional spreadsheets ignore and the process of building it unfolds in three strategic phases:
Phase 1: Establishing the ground truth
Before we ask an AI for help, we perform a “Deep Dive” into our historical data to separate coincidence from repeatable success through two rigorous tests:
- The consistency test (purity): We look for a clear “verdict”. For example, if a specific audience and brand pairing resulted in a positive outcome 85% of the time, we have a reliable pattern. The data is giving us a clear “Yes”.
- The volume test (cardinality): Consistency only matters if it happens often. For instance, if a positive connection appears consistently, we know it’s a statistically significant trend, not just a stroke of luck.
By filtering through these lenses, we identify the bedrock of our dataset.
Phase 2: The expert second opinion
Next, we turn to the “Emerging Patterns”: combinations where the data shows a clear leaning (like a 60% success rate), but the evidence isn’t yet overwhelming. Historically, these “maybe” scenarios might have been ignored. Now, we invite the LLM to act as a Strategic Consultant.
- The power of the upvote: When the LLM’s intuition aligns with our data’s hints, we gain a new level of confidence. For example, if a location and brand pairing resulted in a positive outcome 60% of the time, we suspect there is a pattern, but we need the LLM to confirm.
- Validation through synergy: By getting a “Yes” from the AI to back up our data, we move these patterns from the “maybe” pile into our active knowledge base.
Phase 3: Illuminating the “dark spots”
Finally, we shift our focus to the areas our data couldn’t reach, the “dark spots”. These are combinations that were excluded because our data was too noisy or the scenarios were entirely new.
We identify every combination where we currently lack confidence. However, this is a vast number of combinations, making it computationally infeasible to check all of them. That’s why we sample these gaps and ask the LLM for original insights based on its understanding of global markets. To give an example, the cases we’re targeting here look like: a specific audience and location pairing that is non-existent in the dataset, or a combination that results in a positive outcome half the time and a negative outcome the other half. Because of the lack of a clear pattern in the real data, we directly ask the LLM.
In short, this allows us, through the LLM, to clarify noisy data and predict outcomes for entirely new scenarios.
The challenge of illuminating our data “dark spots” led to a major breakthrough. Our first approach, the Hybrid Graph, essentially looks at low-signal campaign data and uses LLMs to make an educated guess to fill in the gaps. But this sparked a bigger, more strategic question: Instead of just guessing what’s in the dark, how can we actively hunt for the exact pieces of missing information that will make our predictions smarter? Out of millions of possible campaign combinations, how do we pinpoint the specific scenarios that will teach our model the most? This strategy of “smart hunting” forms the foundation of our second approach: Active Learning.
Phase 4: Combining everything together
The culmination of this research is a unified Hybrid Graph. By merging our proven history, our validated suspicions and our newly discovered insights, we create a living map of intelligence.
The result is a specialised dataset that is expected to offer the best of both worlds:
- The grounding of reality: Rooted in the hard facts of our actual campaign history.
- The foresight of AI: Enhanced by the vast, contextual knowledge of the LLM.
Strategy 2: Active Learning – solving the puzzle of uncertainty
Where the Hybrid Graph fills gaps with new insights, Active Learning focuses on a different truth: data isn’t always helpful if it’s redundant. To truly advance our models, we don’t need more of what we already know; we need clarity in the “grey areas” of our knowledge.
For example, imagine our data clearly shows that “TikTok campaigns” aimed at “Gen Z” are consistently successful, while “LinkedIn campaigns” aimed at “Millennials” usually underperform. But what happens if we want to run a “TikTok campaign” for “Millennials”? The model might be completely unsure if there is no clear pattern for that specific combination. Instead of analysing thousands more Gen Z campaigns we already understand, Active Learning specifically targets this exact missing combination. By resolving this one grey area, the model learns whether the platform or the audience age is the true driver of performance.
In the world of data, this uncertainty occurs when there isn’t a strong, consistent signal: when parts of a dataset tell conflicting stories, making it difficult to separate real patterns from mere noise.
In modern marketing, the number of possible combinations between audiences, brands and locations is astronomical; blindly analysing every single one would be incredibly slow, if not impossible. Instead, we use Active Learning as a strategic guide to identify the specific “pockets” of a dataset where our current models are struggling the most. It sifts through the records and picks only the most confusing, yet valuable, points for evaluation.
By focusing our efforts strictly on the areas where the model is most uncertain, we achieve two major goals:
- Maximised intelligence: We gain the most knowledge from the fewest possible data points.
- Operational speed: We bypass the “noise” of what we already know, allowing us to build high-performing models in a fraction of the time.
Ultimately, this approach turns a daunting, “infinite” dataset into a manageable, high-impact asset.
The LLM as our “oracle”
Identifying the most uncertain points in our data is only half the battle; the real value lies in what we do with them. Once we have selected these high-priority “grey areas,” we bring in the LLM to act as our oracle.
Using sophisticated prompting techniques, we present these uncertain points to the AI for a professional verdict. Our goal is to transform these pockets of doubt into certainty, backed by high-quality expert information.
By doing this, we effectively bridge the “information gap”. We aren’t just adding more data for the sake of volume; we are harvesting targeted knowledge. This process turns a previously unknown variable into a strategic asset, ensuring that our final model isn’t just a reflection of what we’ve seen before, but a fusion of our experience and the AI’s broader market expertise.
Two paths to higher intelligence
To find the most efficient way to “teach” our models, we experimented with multiple different strategies for choosing which questions to ask our LLM oracle. Below, we outline our core foundational technique and the more advanced method that has proven to be our most effective to date.
Approach 1: The broad search
This is a high-level “scouting” mission. We create a large pool of random potential campaign scenarios and ask our current model to predict how they would perform. We then identify the scenarios where the model is the most confused, the “shaky” predictions, and send those directly to the LLM oracle for a definitive answer. It’s a fast, effective way to shore up general weaknesses in our knowledge.
Approach 2: The targeted stress test (our top performer)
Our most successful approach is much more surgical. Instead of looking at random scenarios, we actively look for the “tipping points”, the exact moment a campaign shifts from being a success to a failure, or vice versa.
- Finding the edge: We take a known successful campaign and a known failure, then subtly blend their features to create a new, “borderline” scenario.
- Measuring confusion: We keep adjusting the features until a pre-trained auxiliary model (in this case, a tree-based one) flips its prediction. We then rank and select the scenarios where the outcome is most uncertain, ensuring we capture the most informative data points for our oracle to review.
- The expert verdict: We present these precise “tipping points” to the LLM oracle. By giving the AI specific examples of similar successes and failures as context, we get an incredibly high-quality label.
- Iterative learning: Once the LLM provides the answers for these “grey areas,” we integrate them into our official records. We then retrain our auxiliary model on this newly enriched dataset, making it instantly more precise. From there, the process begins again, creating a continuous loop that proactively hunts for and eliminates our model’s remaining blind spots.

By repeating this process, we don’t just add data; we specifically “fix” the model’s most significant blind spots. This iterative loop ensures that our final engine isn’t just bigger but it’s also significantly smarter.
Results
The balancing act: Extracting the final datasets
Building a Hybrid Graph is a delicate exercise in calibration. Our challenge was to find the perfect equilibrium: How much should we trust our internal data and how much “weight” should we give to the LLM’s external knowledge?
To test this, we generated several different graph versions, eventually selecting the largest and most robust one. This ensured our Synthetic Data Generator had a dense enough “knowledge web” to create high-quality, non-random datasets. To keep our findings clear, we kept environmental “noise” to a minimum, ensuring we were testing the core intelligence of the graph itself.
Similarly, when building the datasets to test our Active Learning strategies, we had to find the right blend of human experience and AI insight. After testing multiple configurations, we discovered our “Golden Ratio” was in the region of 80% Real-World Data and 20% LLM Knowledge. This 80/20 balance proved to be our most effective setting. It ensures the model remains firmly grounded in the proven reality of WPP’s historical success, while still allowing enough “AI intuition” to fill in the gaps and explore new strategic frontiers.
The reality check: Lessons from the data
To evaluate the results, we ran a “head-to-head” test. We trained one model using only real-world data and another one using our LLM-enhanced hybrid dataset. We then tested both against a “holdout” set of real campaign results.
Here are the results of our models, trained on the real dataset and tested against the holdout:
| Model | Overall F1 | Neg F1 | Avg F1 | Pos F1 |
|---|---|---|---|---|
| Tree-based | 60% | 54% | 72% | 55% |
| Deep Learning | 67% | 60% | 79% | 63% |
Building on the foundations of our previous research (From guesswork to foresight: How AI is predicting the future of marketing campaigns), we transitioned our models from a controlled synthetic environment to the complexities of 100% real-world campaign data.
Our standard models, which previously proved their strength in synthetic testing, delivered a highly competitive baseline. This “Reality Benchmark” set a high bar, while simultaneously identifying clear opportunities for our LLM-based techniques to add value.
The results revealed a clear trend: while the models excelled at identifying “Average” campaigns, they struggled to pinpoint the extreme “Positive” or “Negative” outliers. This is a common phenomenon in real-world marketing. Unlike our controlled synthetic environments, where we can perfectly balance the ratios, real-world data is heavily weighted toward “average” outcomes. Exceptional successes and disasters are rare, making them significantly harder for a model to learn and predict.
Within this context, Deep Learning (v2) emerged as our strongest baseline, achieving a solid 67% overall F1 score. The Tree-Based Approach performed slightly below the Deep Learning architecture, reinforcing our decision to move toward more “relational” neural networks to navigate the noise and imbalance of complex marketing datasets.
By establishing this 67% mark as our “Line in the Sand,” we can clearly measure the true impact of our Hybrid Graph and Active Learning interventions. Here is how our LLM-enhanced methodologies performed:
| Method | Variant | Model | Overall F1 | Neg F1 | Avg F1 | Pos F1 |
|---|---|---|---|---|---|---|
| Hybrid Graph | 13k rows 60% density | Tree-based | 37% | 51% | 10% | 50% |
| Deep Learning | 17% | 11% | 2% | 37% | ||
| 90k rows 60% density | Tree-based | 42% | 57% | 10% | 60% | |
| Deep Learning | 24% | 14% | 8% | 35% | ||
| Active Learning | Broad Point Search Real 80%, LLM 20% | Deep Learning | 66% | 59% | 77% | 62% |
| Targeted Point Search Real 78%, LLM 21% | Deep Learning | 68% | 60% | 80% | 63% |

The findings in the table above were unexpected, but deeply insightful: We noticed a significant drop in performance when the LLM was added to the loop for hybrid graph and almost no increase with active learning.
The hybrid graph challenge: A significant divergence
The most striking finding was the performance of the Hybrid Graph. Despite increasing the data volume to 95k rows, the scores dropped significantly, bottoming out at 17% to 42%.
This drop reveals a fundamental truth: generic LLMs are trained on public domain knowledge. They lack the specialised, proprietary marketing intelligence that WPP possesses. By weaving general AI “intuition” into a specialised graph, we introduced noise that actively diluted the high-quality signals of our real-world data.
Even with a denser graph, our models struggled to maintain a consistent F1 score. This proves that marketing success hinges on the niche, proprietary data unique to our field, information that simply isn’t available in the public sphere, rather than just a larger volume of generic data.
Active Learning: Reaching the efficiency frontier
In contrast, our Active Learning strategies, specifically the Targeted Point Search, successfully met the benchmark. Using our “Golden Ratio” (78% Real / 21% LLM), the Targeted Point Search achieved a 68% F1 score, slightly outperforming our best real-world baseline.
While our Targeted Point Search allowed us to maintain performance levels comparable to our 100% real-world baseline, we have to be honest: we expected a more significant leap. To justify a process of this complexity, the “performance lift” needs to be undeniable. This brings us to two critical, strategic questions:
- The quality risk: For such a marginal improvement in accuracy, is it worth introducing external AI “intuition” into our proprietary ecosystem when we cannot be 100% certain of its quality?
- The computational cost: Does the slight increase in predictive power justify the high computational expense and the mathematical difficulty of hunting for these “tipping points”?
In its current state, the answer is a cautious “No”. While the technology is fascinating, the results prove that our internal, high-fidelity data is already doing the heavy lifting. Introducing expensive, public-model “noise” for a 1% gain doesn’t just challenge our efficiency, it risks diluting the “Gold Standard” intelligence that WPP already possesses.
The strategic conclusion: Expert-led AI
Our research serves as a powerful reminder that AI is a force multiplier, not a replacement. The performance drop we saw with the Hybrid Graph Dataset underlines the immense competitive advantage of WPP’s proprietary data; generic models simply cannot replicate the “niche” intelligence we already possess.
While Active Learning was able to match our 67% baseline, “matching” the status quo is not enough to justify the hype or the computational cost.
The core insight: Data quality is the ultimate moat
This research proves a fundamental truth: Data quality is everything. A generic AI cannot replace the deep, specialised expertise of a marketing professional. The failure of the “public” LLM to improve our results demonstrates that the real path to success lies in keeping our experts in the loop. By using high-fidelity, professional strategy rather than general internet trends, we ensure our models are learning from the best in the business.
Moving forward: From baseline to breakthrough
To bridge the gap between “adequate” and “exceptional,” we have identified two clear technical paths to evolve this research:
- The fine-tuned oracle: Our current experiments used “off-the-shelf” LLMs. To truly elevate the results, the next logical step is to use fine-tuned models: AIs that have been specifically trained on WPP’s historical successes and internal playbooks. This transforms the oracle from a generalist into a marketing specialist.
- Real-world Active Learning: The ultimate validation of Active Learning isn’t a digital oracle; it’s the market itself. A Real-Time Loop can be implemented using Active Learning, to identify high-potential “blind spots,” launching those as live test campaigns and then feeding that real-world performance back into our models. This moves us from theoretical testing to real-world evolution.
Ready to explore the specifics? Read our full technical deep dive into Data Enrichment Pod for a closer look at our methodology.
Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP Research team.
Leave a Reply