Using Synthetic Data to Train and Stress-Test Marketing Machine Learning Models

Unlocking machine learning experiments across multiple teams with a synthetic data pipeline grounded in marketing knowledge

Training Machine Learning (ML) models for marketing usually starts with a hard requirement: labelled data that links campaign settings and attributes to actual performance outcomes. You collect campaigns, look at what combinations of brand, audience, platform, and geography performed well, and train a model to learn from those patterns.

In theory, that sounds straightforward, but in practice, real data is hard to clean and structure, arrives slowly, takes time to accumulate and only reflects combinations you’ve already run. If you’ve never targeted a certain audience on a certain platform for a certain brand, that example simply doesn’t exist in the dataset. And if multiple teams are waiting on that data before they can even begin experimenting, progress stalls fast.

We ran into exactly that problem.

We needed a way to start training and benchmarking marketing ML systems before AI-ready real campaign data was available at a useful scale. So instead of waiting for the data, we built a synthetic data pipeline that could generate realistic, labelled training data grounded in how marketing actually works.

That pipeline ended up unblocking model experiments across multiple teams.

The Problem With Random Synthetic Data

Real campaign data is essentially rows of campaign attributes (brand, audience, location, platform, placement, creative, and many more) each labelled with how that combination performed. That’s what a model learns from.

This kind of data is easy to fake badly. You can always create random combinations of attributes and assign them labels. But for marketing, random is worse than useless if it ignores real-world compatibility. A luxury brand paired with bargain-hunting audiences, or a B2B enterprise software brand matched with a fashion lifestyle platform, doesn’t help an ML model learn. It teaches the wrong lessons.

So the challenge wasn’t just “generate fake data”. It was:

  1. Capture that marketing knowledge in a structured, machine-readable form
  2. Use that structure to generate realistic campaign configurations at scale

What we needed was a structured way to encode that compatibility: given any combination of campaign settings, does it make sense or not?

Encoding Marketing Knowledge as a Graph

We chose a versatile structure: a graph.

In a marketing knowledge:

  • Nodes represent attribute values for different modalities, such as brand, audience, platform, country, and any other factor that can influence the outcome of a campaign.
  • Edges represent compatibility between two attributes:
    • A positive edge (+) means the pair is expected to work well together within the same campaign.
    • A negative edge (-) means the pair is a bad fit, likely to damage the cohesiveness and the performance of the campaign.
    • No edge means there’s no meaningful signal. A neutral fit.

That gives us a machine-readable map of marketing relationships.

Some simple examples:

  • LinkedInC-Suite Executives → positive
  • Luxury brandBudget shoppers → negative
  • SalesforceTikTok → negative
  • AdidasK-pop fans → positive

This structure worked well for three reasons:

  • It naturally captures many-to-many relationships
  • It’s easy to extend with new brands, audiences, and platforms
  • It’s interpretable enough for humans to inspect and validate

Once you have that graph, you can start generating synthetic campaign examples that are constrained by actual compatibility signals instead of randomness.

The Bottleneck: Building the Graph was Expensive

The obvious way to build this graph was to leverage the capabilities of Large Language Models to classify every possible pair of attributes from a catalogue of brands, audiences, geographies, and other marketing settings of interest.

That approach can work for small catalogues, such as 20 brands, 50 audiences, 10 countries, and 5 platforms. But those are not especially useful in practice, since ML models need data that is both diverse and high-volume.

As the catalogue grows, pairwise combinations quickly become a bottleneck. Even a moderately sized catalogue creates thousands of cross-modality pairs. As the number of attributes increases, the number of possible pairs grows quadratically. That made a brute-force approach too slow and too expensive for routine iteration. Even considering batch calls, like a primary attribute compared to a target list of attributes, it would still be too much.

So we needed a way to build the graph without evaluating the entire space of possible combinations.

But that creates an obvious dilemma: how do you find the important pairs without first checking them all?

Two Ways We Approached Graph Generation

To answer that question, we implemented and compared two graph generation strategies.

1. Batched brute-force pair classification

A truly naive strategy would have been to ask the LLM about every single attribute pair one by one, but we did not test that because it is clearly too inefficient to be practical.

Instead, for each valid cross-modality combination, we selected one primary attribute and asked the LLM to classify its relationship to a batch of up to 25 target attributes as positive, negative, or neutral.

The batch size of 25 was chosen deliberately:

Prior work shows that batch size affects LLM classification quality: larger batches are more efficient, but can reduce consistency across judgments. We therefore set the batch size as a practical trade-off between efficiency and quality.

This gave us a strong reference point: broad coverage with a simple implementation, useful for evaluating whether a more efficient method could preserve similar graph quality without the same cost.

2. Cluster-first graph generation

The second approach was designed to reduce the search space before asking the LLM to score anything.

Instead of classifying every attribute pair directly, we first:

  • embedded the attributes and applied UMAP for dimensionality reduction,
  • clustered them by modality using HDBSCAN,
  • asked the LLM to batch score compatibility between clusters,
  • discarded neutral cluster pairs and their attribute combinations,
  • automatically assigned scores to attribute combinations derived from high-confidence cluster pairs,
  • and asked the LLM to batch classify only the remaining attribute pairs.

This turned a very large search space into a much smaller one, so the LLM spent time only where useful signals were more likely to exist.

For small catalogues, the efficiency gains are smaller because many attributes end up as singleton clusters, but the same architecture still applies.

What Happened When We Compared Them

On a larger catalogue of 160 attributes — 60 brands, 60 audiences, 10 platforms, and 30 countries — the cluster-based approach performed much better operationally.

Compared with brute force, it delivered:

  • 53% fewer LLM calls
  • 50.5% less execution time
  • 90.6% of the total edge volume retained

More importantly, where both methods produced an edge for the same pair, they agreed on the sign 98% of the time. This shows that the cluster-based approach is not systematically changing the meaning of the relationships it recovered.

The main trade-off was coverage: some pairs found by brute force were filtered out before attribute-level scoring, likely around lower-signal or more borderline cases.

In practice, this gave us a much cheaper way to generate the graph while preserving the compatibility signal that mattered most.

The scaling advantage becomes even clearer when projected to larger catalogues:

Batched brute-forceBatched cluster-based
Catalog Size**Total pairs *****LLM calls *****LLM calls ***
160 attributes8,700570265
320 attributes (2×)~34,800~2,280~750
800 attributes (5×)~217,500~14,250~2,960
1,600 attributes (10×)~870,000~57,000~8,400
  • These are directional estimates extrapolated from the 160-attribute experiment. Actual call volumes will vary with catalogue structure, clustering behavior, and graph densities.

From Graph to Actual Training Data

Once we had a signed graph, the next problem was turning it into an actual labelled campaign performance dataset.

Each row in this dataset represents one synthetic campaign configuration (a combination of attributes drawn from the graph) along with a performance label: pos, neg, or avg. That label is the training target. It describes whether the overall campaign combination is expected to perform well, underperform, or land somewhere in between.

Important note: The label is not the same as a graph edge. Edges score pairs of attributes; the label scores the whole configuration, aggregated across all its edges signs.

Figure 1 – Example of row from the campaign performance dataset

This dataset is the output of the second service in the pipeline: the Synthetic Dataset Generator. Its job is to create synthetic campaign records from the graph while respecting configurable constraints such as:

  • how many attributes of each type should appear in each sample,
  • how many positive, negative, and average examples to produce,
  • and what proportion of positive vs. negative edges each label class should contain.

For example, a positive sample might require a relatively high fraction of positive edges and a low fraction of negative ones. A negative sample would do the opposite, while an average sample would contain more balanced fractions of both.

That gave us complete control of the dataset. The same graph could generate multiple datasets (with different class balances, difficulty levels, noise profiles, and schemas), just by changing configuration, not rebuilding the pipeline.

Simulated annealing: searching the graph efficiently

To find valid combinations to generate each dataset row efficiently, we used a parallelized simulated annealing sampler. The name comes from a steel mill process, where a material is heated and then cooled in a controlled way to reduce defects and settle into a more stable structure.

Our algorithm follows the same idea. It starts in a “hot” state, exploring many possible campaign configurations and even accepting imperfect ones early on. As it cools, it becomes more selective, swapping attributes in and out until each sample settles into a configuration that satisfies the requested constraints.

Downstream Impact and ML Experiments Unlocked

This service was not just a technical exercise. Its purpose was to unblock machine learning workstreams while real campaign data was still limited, not ready, or missing key combinations. Without it, multiple experiments would have been blocked.

The Synthetic Dataset Generator produced 49 synthetic datasets, built from multiple graph versions and configurations. Those datasets were used to both train and stress-test models across different teams and modelling approaches. Each dataset varied in class balance, difficulty, and noise to probe how models behaved under pressure. Experiments included:

  • Campaign performance prediction
  • Federated learning experiments
  • Architecture search and model benchmarking
  • Comparisons between fine-tuned LLMs and custom classifiers

We also built a shared model leaderboard so teams could compare results across dataset versions and training approaches without manual coordination.

That created a common experimental foundation before real data was fully ready.

What Synthetic Data Did (and Didn’t) Solve

Synthetic data was an accelerator, not a replacement for real data.

It let us:

  • start ML experiments earlier,
  • benchmark model architectures,
  • explore dataset schemas,
  • test class balance and difficulty settings,
  • and support teams that otherwise would have had to wait

But it also has several limitations:

The biggest one is that graph edges are still inferred, not directly validated against large-scale real campaign outcomes. We verified obvious cases, but many of the more ambiguous relationships remain assumptions generated from LLM reasoning rather than empirical evidence.

References

Van Can, A. T., Aydemir, F. B., & Dalpiaz, F. (2025). One size does not fit all: On the role of batch size in classifying requirements with LLMs. In Proceedings of the 2025 IEEE 33rd International Requirements Engineering Conference Workshops (REW 2025) (pp. 30–39). IEEE.

Tam, Z. R., Wu, C.-K., Tsai, Y.-L., Lin, C.-Y., Lee, H.-Y., & Chen, Y.-N. (2024). Let me speak freely? A study on the impact of format restrictions on performance of large language models. arXiv:2408.02442https://doi.org/10.48550/arXiv.2408.02442

Delahaye, D., Chaimatanan, S., & Mongeau, M. (2019). Simulated annealing: From basics to applications. In M. Gendreau & J.-Y. Potvin (Eds.), Handbook of Metaheuristics (Vol. 272, pp. 1–35). Springer. https://doi.org/10.1007/978-3-319-91086-4_1

Authors

  • Rafaela is a Senior Data Engineer and Architect at Satalia and part of the WPP Research team, where she builds the data foundations that enable AI solutions to perform at scale. With 8 years in data systems and a background in Control Systems Engineering, she has worked with multiple clients across retail, mining, metallurgy, marketing, and telecom, covering data lakehouse architectures, data governance, and end-to-end system design. She also co-hosts Entre Chaves, a Brazilian software development podcast.

  • Ted co-leads WPP Research and serves as Head of Data Science at Satalia, co-founder of Conscium, and Assistant Professor in the Department of Marketing and Communication at the Athens University of Economics and Business. His research spans scalable algorithms for multimodal data, synthetic data generation, simulation-based verification for AI agents, and information diffusion and collective intelligence in expert networks. He publishes regularly in top-tier computer science and business venues.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *