Effectively predicting digital marketing campaign outcomes is challenged by the complex, multimodal nature of campaign data. To advance this, we developed multimodal fusion models for multiclass campaign outcome classification, rigorously evaluated using a Synthetic Data Generator to create controlled datasets. Benchmarking a LightGBM model against two deep learning architectures, our findings consistently show deep learning models significantly outperform the baseline, especially in noisy and complex environments. The Robust Modality Network, our top performer, maintained over 80% Macro F1 score under challenging conditions, confirming the robustness of embedding based, multimodal deep learning for reliable campaign prediction.
If you don’t care about the technical details, read our blog post instead.
From data to growth: Benchmarking modality-aware models for campaign prediction
Predicting the performance of digital marketing campaigns prior to deployment is an intricate, high- dimensional optimisation problem. Historically, navigating this complex and scattered landscape of decisions has required significant human expertise and iterative adjustments to optimise outcomes. A primary technical hurdle in solving this programmatically is the extreme heterogeneity of the feature space. Campaigns are inherently multimodal, composed of fundamentally different data types: structured demographic data (audiences), unstructured semantic text (brand identity), visual assets (images), and spatial coordinates (geographic locations).
To address this, we frame campaign performance prediction as a multiclass classification problem, where the objective is to categorise expected outcomes into discrete performance tiers (e.g., Negative, Average, Positive). To process the disparate inputs, we employ multimodal fusion models designed to independently encode these varied data types and project them into a unified, continuous representation space.
The core question driving our research is whether these advanced machine learning architectures can reliably decode the intricate, cross-channel variables that dictate campaign success.
Can multimodal fusion models, when rigorously evaluated against controlled synthetic environments, accurately capture and predict the complex, non-linear dependencies that drive marketing performance?
This documentation details our machine learning strategies designed to transition campaign creation from historical guesswork to data-driven predictive modelling.
Datasets & encoding
Dataset
To evaluate our models effectively, we needed a data source that offered more precision than standard real-world datasets. In many domains, “ground truth” labels can be subjective or noisy, if known at all. To bypass this, we used the Synthetic Dataset Generator, a service that produces labelled datasets with deterministic characteristics.
By using synthetic data, we can control the difficulty of the task and ensure that the labels (Positive, Average, or Negative) are derived from a known mathematical structure rather than human intuition.
At the core of our generator is a deterministic graph structure. Each data point is not just a collection of random attributes, but a set of interconnected nodes. The class label for any given entry is determined by the specific edges (relationships) between these nodes.
The attributes used in our datasets are organized into five distinct modalities:
- Audience: A composite profile including gender, age groups, and specific interest categories.
- Brand: A descriptive paragraph outlining the brand’s identity and core values.
- Creative: A text-based caption describing the visual asset used in the campaign.
- Geo: A list of specific zip codes representing the target geographic locations.
- Platform: The delivery channel or platform used for the distribution.
By adjusting the configuration of the generator, we can manipulate the complexity of the relationships within the graph. The signal strength is defined by the fraction of positive versus negative edges within a data point. By manipulating these thresholds, we can directly tune the separability of the classes, making the generator a flexible tool for benchmarking and stress-testing models under different conditions.
We generated three distinct dataset versions, of increasing complexity:
- v25 (strong signal): This is our “cleanest” dataset and serves as the baseline. In this dataset, in each data point, more than 45% of the edges belong to the target class, while noise from the opposite class is capped at 25%. With high separation and minimal interference, this version tests the model’s ability to identify clear patterns.
- v26 (medium signal): This variant introduces significant “label pollution.” While target class edges can reach up to 90% density, the opposite class noise can rise to 35%. This creates a much noisier environment, particularly for the “Average” rows, which consist of a messy, conflicting mix of both positive and negative edges.
- v28 (low signal / high volume): This is our most challenging stress test. The signal is weaker compared to the previous datasets, from only 34% to 44% target edges, while noise remains high (12% to 24%). In this set, the “Average” class is characterized by sparsity. Rather than being a mix of signals, these rows contain fewer edges overall, most of which connect to unrelated nodes. This forces the model to learn how to identify the absence of a strong signal rather than just filtering out noise.
Detailed parameter configurations for the positive, negative, and average datasets are provided in the table below.
| Label Params Category | Parameter | V25 | V26 | V28 |
|---|---|---|---|---|
| pos | pos_frac_range | [0.45, 1] | [0.45, 0.9] | [0.3387, 0.4387] |
| neg_frac_range | [0, 0.25] | [0.1, 0.35] | [0.1192, 0.2444] | |
| num_samples | 18,000 | 18,000 | 24,000 | |
| neg | pos_frac_range | [0, 0.25] | [0.1, 0.35] | [0.1035, 0.2259] |
| neg_frac_range | [0.45, 1] | [0.45, 0.9] | [0.3833, 0.4833] | |
| num_samples | 18,000 | 18,000 | 24,000 | |
| avg | pos_frac_range | [0.25, 0.5] | [0.25, 0.65] | [0.1035, 0.2259] |
| neg_frac_range | [0.25, 0.5] | [0.25, 0.65] | [0.1192, 0.2444] | |
| num_samples | 24,000 | 24,000 | 72,000 |
Encoding
Since our dataset features range from long-form text descriptions to specific geospatial markers, we employ distinct encoding strategies to transform these modalities into a format our machine learning models can process.
Text embeddings
For all text-based fields, we use the text-embedding-005 model (part of the Google Gemini family). We chose high-dimensional embeddings over traditional methods like “one-hot encoding” because they capture semantic relationships. For instance, an embedding model allows the system to recognise that “Instagram Stories” and “Facebook Reels” are semantically similar, even if the raw text doesn’t match. This ensures that the nuances of our campaign metadata are preserved, providing the predictive model with a rich, context-aware starting point.
Depending on the specific model architecture, we fine-tuned the embedding parameters to match the downstream task:
- Deep Learning (Arch 1): We use a dimensionality of 768 with the task type set to CLASSIFICATION.
- LGBM & Deep Learning (Arch 2): We use a dimensionality of 512 with the task type set to SEMANTIC_SIMILARITY.
Geospatial encoding
The Geo modality consists of US zip codes. Rather than treating these as simple text strings or categorical IDs, we use Google’s Population Dynamics Foundation Embeddings (PDFM).
These are dense vector representations designed to encapsulate the complex, multidimensional interactions between human behavior, environmental factors, and local contexts at specific locations. By using PDFM, the model can capture spatial relationships and demographic overlaps that would be impossible to detect using zip codes alone.
To aggregate our geospatial data, we use a straightforward but effective pooling strategy. Since each campaign can target a list of multiple US zip codes, we retrieve the individual PDFM embedding for each location in the set. We then calculate the centroid (mean) of these vectors to produce a single, final representation for the entire geographic list.
By averaging the embeddings, we create a unified “geographical profile” that represents the collective demographic and behavioral characteristics of the target area, ensuring the model receives a fixed-size input regardless of how many zip codes are in the list.
Methodology
After extensive experimentation with various machine learning approaches, we converged on three primary models to evaluate predictive performance across our synthetic datasets. We use a high-performing tree-based model as our baseline benchmark, alongside two distinct deep learning architectures designed to capture multi-modal relationships.
Hierarchical LightGBM
Our baseline approach uses a custom LightGBM implementation that treats the multi-class problem (Negative, Average, Positive) as a two-stage hierarchical classification. By decomposing the problem into two simpler binary tasks, we can optimise the model to better distinguish the “Positive” class from the others.
The model consists of two separate LightGBM classifiers that run sequentially during inference.
Stage 1: Identifying the positive class
The first classifier focuses on separating positive samples (class 2) from the combined pool of negative and average samples (classes 0 and 1 respectively). It is trained on the entire dataset with balanced class weights. During inference, if the model’s predicted probability for Class 2 exceeds a dynamically tuned threshold (pos_threshold), the sample is definitively assigned to the Positive class.
Stage 2: Distinguishing negative from average
Any sample not classified as Positive in Stage 1 is passed to the second classifier. This stage is trained exclusively on a subsample of the data (where the true label is not Class 2) to focus on the nuances between Class 0 (Negative) and Class 1 (Average). If the predicted probability for Class 0 is greater than 0.5, it is assigned to the Negative class; otherwise, it is labelled Average.
Data handling and class drift
To ensure the model remains robust against distribution shifts, we implemented two key data-processing steps:
- Categorical encoding: While most features use high-dimensional embeddings, the Platform modality is treated as a categorical value. It passes through a standard LabelEncoder before being fed into the LightGBM models.
- Addressing class drift: To prevent the model from overfitting to a majority class, we monitor the target distribution during training. If the majority class cardinality exceeds the minority by more than 1.4x, we apply a selective downsampling strategy. This caps the majority class size and ensures the gradient boosting process isn’t biased toward the most frequent labels.
Shared latent space (Deep Learning Arch1)
The first deep learning architecture is designed to project all five modalities (regardless of their original format) into a unified, shared latent space. The goal is to ensure that related features (like a specific “Brand” and a specific “Geo”) that perform well together are placed close by each other, before the model attempts to classify them.
The projection architecture
Each modality begins by passing through its own dedicated Projection Block. This stage standardizes the varied input dimensions into a consistent representation using a sequence of:
- Linear Layer: Mapping the input to a common projection dimension.
- ReLU Activation: Introducing non-linearity.
- LayerNorm: Ensuring stable gradients and consistent scaling across different feature types.
Cross-modal alignment (similarity loss)
To force the model to learn how these modalities interact, we implement a Cosine Embedding Loss. We calculate this loss across every possible pair of modality projections (e.g., Brand vs. Creative, Audience vs. Platform).
By setting a positive target for these pairs, we mathematically “pull” the vectors closer together in the latent space. This ensures that features belonging to the same campaign are mapped to a similar neighborhood in the vector space, creating a coherent, multi-modal representation.
The classification head
Once the projections are aligned and concatenated, they are fed into a Multi-Layer Perceptron (MLP) for the final prediction:
- Linear Layer: Aggregates the combined, multi-modal features.
- ReLU & Dropout (0.3): Provides non-linearity while preventing overfitting by randomly deactivating neurons during training.
- Linear Layer (output): Maps the features to the three target classes (Negative, Average, Positive).
Optimisation
Apart from the cosine embedding loss, we use CrossEntropyLoss, as well, to optimise the classification task, teaching the model which feature combinations lead to specific outcomes. The entire network is trained using the AdamW optimiser, which provides effective weight decay to maintain the integrity of the learned embeddings.

Robust modality network (Deep Learning Arch2)
While our first architecture focused on aligning different data types into a shared space, our second deep learning model, the Robust Modality Network, is designed for resilience. Its primary goal is to handle data sparsity and prevent the model from becoming overly dependent on a single “strong” feature, such as a specific platform or brand description.
Deep projection and feature weighting
Unlike the single-layer projections in Arch 1, this model uses a deeper projection block for each modality. By passing each input through a sequence of Linear -> Dropout -> Linear -> LayerNorm, the model gains enough capacity to “filter” the high-dimensional embeddings. This allows the network to internally re-weight the features, emphasising the specific parts of an embedding (like a particular interest group within the Audience modality) that are most predictive of the final outcome.
Modality dropout: Forced independence
One of the unique features of this architecture is the implementation of a Modality Dropout layer. Before the data reaches the final classification head, the model randomly “zeroes out” an entire modality (e.g., completely removing the “Creative” or “Geo” data for a specific training pass).
This serves two critical purposes:
- Robustness: It mimics scenarios where certain data might be missing or “noisy” in production.
- Feature importance: It prevents the model from “cheating” by over-relying on one dominant modality. By forcing the model to make accurate predictions even when a key feature is missing, we ensure it learns the subtle interactions between the remaining inputs.
Streamlined MLP head
After the modalities are projected and filtered through the dropout layer, they are concatenated and fed into a robust, two-layer Multi-Layer Perceptron (MLP). This head uses ReLU activations and standard dropout to process the combined features into the final three-class prediction.
During development, we experimented with complex attention mechanisms to weigh these modalities. However, we found that this simpler, MLP-based approach, combined with Modality Dropout, yielded equivalent performance with significantly lower computational overhead and better generalization.
Optimisation
As with our first architecture, we use CrossEntropyLoss to optimise the parameters and AdamW as the optimiser. This combination effectively teaches the projection layers which feature combinations lead to Positive, Average, or Negative performance while maintaining a stable training process.

Experimentation
Leaderboard
To maintain rigorous benchmarking standards, we implemented a centralized Leaderboard platform integrated directly with the Synthetic Dataset Generator. This system acts as our definitive evaluation environment, decoupling model architecture from data preparation and ensuring every experiment is measured against the same constraints.
We expose pre-defined train/test splits through a dedicated API endpoint. By forcing every model to use the exact same data distribution, we effectively eliminate “local validation bias”, the risk of a researcher inadvertently tuning a model to a specific random split of the data.
The evaluation process follows a strict protocol:
- Researchers query the endpoint for specific versioned splits (eg v25, or v28)
- Models are trained on the provided local data
- Predictions are submitted back to the platform for server-side computation
- The evaluation metrics are returned and evaluated by the researcher
This creates a “blind” evaluation process and a persistent system of record, allowing us to trace how performance evolves across different architectures and signal strengths.
While accuracy is a common metric, it can be misleading in datasets where classes aren’t perfectly balanced. A model could achieve high accuracy simply by over-predicting the majority class while failing on the others.
To prevent this, we use the Macro Average F1 score as our primary ranking metric in the Leaderboard. The Macro F1 treats all three classes (Negative, Average, and Positive) as equally significant. This ensures that a model must perform well across the entire spectrum to rank highly, rather than just overfitting to the most frequent label.
To ensure each architecture performed at its peak, we utilized structured optimisation techniques to move beyond default configurations.
Hierarchical LightGBM configuration
- Optimisation: Conducted 10–20 Optuna search trials to maximize Leaderboard performance.
- Efficiency strategy: Hyperparameter searches were performed on a representative subset of the data, with the final model training on the full dataset.
- Custom evaluation metric: Optimised trials using a weighted F1 score to prioritise the classes most critical to campaign success:
- Positives: 50%
- Negatives: 40%
- Averages: 10%
- Tuned parameters:
pos_threshold(for Stage 1 classification)- Stage 1 & 2 individual parameters:
num_leaves,n_estimators,learning_rate,min_data_in_leaf(separate parameters for each stage)
Deep Learning Arch1 configuration
- Search method: Utilized Grid Search across various architectural and training configurations.
- Tuned parameters:
- Network layer dimensions (categorised as Small, Medium, or High).
- Similarity loss weighting constant (balancing the alignment of modalities vs. classification accuracy).
- Batch size variations.
- Training safeguards: Implemented early stopping to prevent overfitting as the latent space aligned.
Deep Learning Arch2 configuration
- Optimisation: Utilized Optuna, similar to the LGBM approach, for a more granular search of the neural network’s hyperparameter space.
- Tuned parameters:
- Learning Rate and Batch Size.
- Dimensions for both the deep projection blocks and the MLP head.
- Dropout rates for both individual neurons and entire modalities.
- Dynamic training callbacks:
- Early stopping: Ends training when validation performance plateaus.
- Learning Rate Scheduler: Automatically reduces the learning rate by 50% if validation accuracy fails to improve, allowing for finer optimisation in the final epochs.
Results
We evaluated our three candidate models across the synthetic datasets to determine how architectural choices impact robustness as we move from high-signal environments to noisy, high-volume scenarios. The following results quantify each architecture’s resilience against increasing data entropy and feature sparsity, using the Macro F1 score as our primary benchmark.h-signal environments to noisy, high-volume scenarios. The following results quantify each architecture’s resilience against increasing data entropy and feature sparsity, using the Macro F1 score as our primary benchmark.
| Dataset | Model | Overall F1 | Neg F1 | Avg F1 | Pos F1 |
|---|---|---|---|---|---|
| V25 | Hierarchical LGBM | 85.11% | 86.4% | 80.21% | 88.73% |
| Deep Learning Arch1 | 86.26% | 89.81% | 82.51% | 86.44% | |
| Deep Learning Arch2 | 89.47% | 91.25% | 85.95% | 91.22% | |
| V26 | Hierarchical LGBM | 85.54% | 87.48% | 80.92% | 88.21% |
| Deep Learning Arch1 | 84.80% | 89.41% | 80.50% | 84.50% | |
| Deep Learning Arch2 | 88.84% | 91.13% | 85.26% | 90.14% | |
| V28 | Hierarchical LGBM | 77.40% | 76.40% | 81.47% | 74.34% |
| Deep Learning Arch1 | 80.74% | 80.88% | 86.10% | 75.25% | |
| Deep Learning Arch2 | 81.29% | 82.25% | 84.58% | 77.03% |
The evaluation reveals a clear distinction between traditional gradient boosting and deep learning models when handling complex data structures.
- The baseline limits: Our LGBM implementation demonstrated strong efficacy in high-signal environments (v25 and v26), consistently maintaining a Macro F1 score above 85%. It proved particularly effective at binary discrimination (Positive vs. Negative), leveraging its ability to find clean decision boundaries. However, we observed a quantifiable performance degradation in the high-noise, high-volume regime (v28), where the score regressed to 77.40%. This indicates a sensitivity to lower signal-to-noise ratios and sparse data.
- The Deep Learning advantage: Both Deep Learning architectures consistently outperformed the baseline across all dataset variations. This performance delta is largely due to the networks’ ability to learn dense representations within a latent space. Unlike tree-based logic, which relies on hierarchical feature splitting, these models (specifically DL Arch 2) effectively capture non-linear interactions between modalities like Brand, Geo, and Audience.
- Generalization under pressure: The architectural advantage became most pronounced in the v28 dataset. While the tree-based model struggled to generalize under high sparsity, the Deep Learning models maintained a Macro F1 score above 80%.
These results confirm that an embedding-based approach, combined with modality-specific projection and dropout, offers superior generalization in high-dimensional spaces. It allows the model to distinguish a weak signal from statistical noise without overfitting to the majority class.
Limitations
While the results on our synthetic benchmarks are promising, several technical and methodological limitations remain that will guide the next phase of our research.
The embedding quality gap
A primary constraint lies in the nature of “off-the-shelf” embeddings. Even with high-performing models like text-embedding-005, we must consider how well a generic model captures domain-specific nuances.
For example, for the brand modality, a descriptive paragraph is compressed into a fixed-length vector. It remains an open question whether this representation truly captures the “brand universe” or the subtle differentiators between competing identities. If the embedding space is too crowded or lacks the granularity to distinguish between two similar but distinct brand strategies, the downstream model’s predictive power will inevitably hit a ceiling.
Synthetic vs. real-world fidelity
Working with synthetic data allows for a deterministic “ground truth,” but it introduces a gap in realism. While our generator mimics complex relationships, we face a few core uncertainties:
- Noise consistency: Is the mathematical noise we introduce in datasets like V28 truly representative of the “messy” data found in real campaigns?
- Modality alignment: Real campaign data often contains unstructured elements that don’t fit perfectly into our five-modality graph.
- The “switch” risk: There is always the risk that a model optimised for a synthetic environment will struggle with the unpredictable variance of live campaign performance. Transitioning from the laboratory to the wild is the ultimate test of our current architectures.
The infinite search space
In machine learning, the architecture search and experimentation lifecycle is never truly “finished.” The sheer volume of potential configurations, from different attention mechanisms to alternative fusion strategies, requires significant time and compute resources to validate.
Prioritising which ideas to pursue is a constant challenge. Every “definitive answer” we find often opens up several new branches of inquiry, meaning our current “best” model is likely just a stepping stone toward a more refined solution.
Future work & next steps
With the synthetic benchmarking phase yielding concrete results, our focus shifts toward validating these findings in production environments and refining our feature engineering pipelines. The transition from a controlled “laboratory” setting to the complexity of live data will be the ultimate test for our current architectures.
- Real-world validation: The immediate priority is to migrate our top-performing architectures from the controlled synthetic environment to the real world. We will test whether the architectural advantages observed in Campaign Performance Modelling Pod translate effectively to real-world campaign data (Data Enrichment Pod).
- Continual learning & data privacy: Develop strategies for continuously updating models with new data without compromising previously acquired knowledge (e.g., mitigating catastrophic forgetting). This is crucial for integrating new clients, predicting their campaign performance, and exploring paradigms like federated learning to ensure data privacy and secure knowledge sharing across different entities (Multimodal Federated Learning Pod).
- Feature representation enhancement: We are launching dedicated workstreams to improve the fidelity of our embeddings:
- Creatives: Optimising the captioning pipeline to capture visual semantics and context more accurately (Self Improvement Performance Agent Pod),
- Brand: Developing richer, more descriptive representations of brand identity beyond simple identifiers (Brand Perception Atlas Pod).
- Geo: Investigating optimal encoding strategies for real-world geospatial data, moving beyond the synthetic zip code generation used in this benchmark.
- LLM-centric experiments: We aim to explore Large Language Models beyond just feature encoding. Future iterations will test LLM Fine-Tuning for direct classification tasks, as well as leveraging generative models for advanced oversampling techniques to address class imbalance in sparse datasets (LLM Finetuning Pod and Data Enrichment Pod)
- Architectural evolution: We will continue to refine our current Deep Learning and Tree-based baselines while actively scouting for novel architectures that may offer better inductive biases for our specific data modalities (experimentation with AlphaEvolve)
- Recommendation systems: The ultimate objective is to deploy these architectures as the backbone of a recommendation engine. A dedicated exploration phase is required to bridge the gap between prediction and prescription. It is a known paradox that the most accurate predictor is not always the most effective recommender; depending on the optimisation strategy used, a model that captures the dense, underlying relationships between features may be more valuable than one optimised purely for label discrimination.
Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP Research team.













