Author: Rafaela Milagres Moreira

  • Synthetic Dataset Generation Pod

    Synthetic Dataset Generation Pod

    We build synthetic data infrastructure for marketing machine learning. Our work focuses on encoding marketing compatibility knowledge into signed graphs. We use large language models (LLMs) to capture which combinations of brands, audiences, platforms, and geographies are expected to perform well together or conflict. This knowledge is then turned into controlled, labelled campaign performance datasets at scale. The result is a system designed to unblock model training and experimentation before real campaign data is available, giving teams precise control over the data.

    For an easier-to-understand overview of this report, please see our non-technical version here.

    1. Motivation

    Marketing Machine Learning (ML) models require large, labelled datasets that connect campaign attributes — such as brand, audience, platform, and geography — to measurable performance outcomes. Real data, while the ultimate source of ground truth, is often sparse, constrained by the distributions of what has already been observed, and requires significant effort to bring into an AI-ready format.

    Synthetic data serves as a powerful accelerator in the meantime, offering precise control over distributions of performance labels, dataset sizes, noise levels, and coverage across multiple scenarios and marketing domains. This flexibility makes it possible to train more robust models and iterate faster, without being dependent on the availability of real campaign data.

    2. Problem Definition

    Random attribute combinations are not useful unless they reflect the real world, so the challenge is to generate synthetic data that is grounded in marketing knowledge. For example, a dataset that pairs a luxury brand with a budget-sensitive audience and labels it as high-performing will actively mislead any model trained on it. So any dataset generator must be built on a structured and reliable source of marketing knowledge.

    Building synthetic data therefore requires two things to exist first:

    1. Marketing knowledge encoding: there is no readily available structured representation of how marketing attributes interact with one another. We need a way to capture which combinations of brands, platforms, audiences, and locations are expected to perform well together and which are likely to conflict across the full combinatorial space a model needs to learn from.
    2. Synthetic dataset generation: given a reliable source of marketing performance knowledge, we need a mechanism to generate diverse, valid campaign configurations, each combining multiple attribute values, and assign them realistic performance labels at scale.

    This project addresses both by implementing a service capable of leveraging large language models to encode marketing compatibility knowledge into a structured format, and a second service that consumes that source to generate realistic, labelled campaign performance datasets at scale.

    2.1 Proposed Solution

    Signed graph as marketing knowledge source

    The structured format chosen to represent marketing compatibility knowledge is a signed graph (Figure 1). Each node represents a specific marketing attribute value (a brand, a platform, an audience interest, or a location), and each edge carries a sign encoding the nature of their relationship. A positive edge means the two marketing attributes are expected to work well together in a campaign; a negative edge means they are likely to conflict. The result is a web of marketing compatibility knowledge that grows richer as more attributes are added.

    Figure 1 – Graph example

    This structure has been chosen because it can naturally capture the many-to-many nature of marketing relationships. It is also easy to extend: adding a new brand or platform simply means adding new nodes and connections. Finally, the graph is interpretable, meaning it can be visualised, inspected, and validated by domain experts, making it easier to audit and refine.

    Synthetic dataset generation

    The second challenge is converting this marketing knowledge graph into synthetic data that a model can actually train on. Generating data from a graph is non-trivial: the combinatorial space of possible attribute combinations is too large to enumerate exhaustively, and random sampling would not yield the controlled, structured scenarios or label distributions required for effective model development.

    The proposed dataset generator addresses this challenge by providing a configurable service that, given a graph as input, samples attribute combinations subject to defined structural constraints, assigns performance labels grounded in the graph’s compatibility signals, and produces versioned, labelled datasets at scale for model training.


    3. The Computational Bottleneck in Pairwise Graph Generation

    The most straightforward approach to building a signed graph is exhaustive pairwise classification: prompting an LLM for every possible pair of attribute values and recording the result as an edge. The challenge lies in how quickly the number of pairs grows.

    For a graph covering 50 brands, 100 audience interests, 10 platforms, and 20 geographic locations, exhaustive pairwise classification already requires approximately 14,500 LLM calls. This number grows quadratically with the size of the attribute space: at 200 brands, 200 interests, 20 platforms, and 50 locations, the call count exceeds 100,000.

    This makes exhaustive classification too expensive for routine use and too slow to support iterative development. It also raises a central challenge: if exhaustive evaluation is infeasible, how should the pairs to classify be selected? Addressing this limitation became a central Q1 technical goal: to identify a graph generation strategy that preserves the quality of LLM-based edge classification while remaining scalable.


    4. Graph Generation Strategies Evaluated

    4.1 Implementation A – Batched Brute-force Pair Classification

    In the brute-force approach, all possible combinations of attributes are generated, and the LLM is asked to classify the expected performance relationship for each pair. This strategy provides broad coverage because every pair is evaluated. However, it introduces two important issues.

    First, the prompt implicitly frames each pair as something worth relating, which can introduce framing bias. As a result, pairs that likely have no meaningful relationship may still receive a weak or speculative classification, making the generated graph overly dense.

    Second, research has shown that such constrained output formats, such as +, -, or 0 to classify performance, systematically suppress LLMs’ natural reasoning processes and force premature commitment to answers (Tam et al., 2024; Yu et al., 2025).

    To reduce the number of LLM calls, batch prompting was introduced so that multiple attribute pairs could be classified in a single request. While this improves efficiency, it may also amplify the second issue.

    Potential issues and challenges:

    • Overly dense graphs with many spurious or low-confidence edges.
    • Computational cost explosion due to quadratic scaling of LLM calls.
    • Increased hallucinations from batch classifications and structured output.

    4.2 Implementation B – Cluster-Based Pre-filtering

    This implementation uses a hierarchical strategy to reduce LLM scoring calls without sacrificing edge quality. Attributes are embedded and clustered per modality using HDBSCAN (with optional UMAP reduction for larger datasets), then an LLM scores only cross-modality cluster pairs rather than all attribute pairs. Cluster pairs falling within a neutral band are discarded, and attribute-level edges are considered only from surviving cluster combinations, reducing the evaluation space from O(n²) attribute pairs to O(k²) cluster pairs, where k is smaller than n.

    Potential issues and challenges:

    • Sensitivity to embedding and clustering quality.
    • The threshold defining the neutral band is a manual parameter. Set too aggressively, it discards semantically meaningful cluster pairs; set too loosely, it recovers the density problem of Implementation A.
    • Increased hallucinations from batch classifications and structured output.

    5. System Overview

    The pipeline consists of two services operating in sequence that will be integrated in one service in the future. The Graph Generator Service takes a versioned list of marketing attributes and produces a signed graph encoding compatibility relationships between them. The Synthetic Dataset Generator then consumes that graph to produce labelled campaign performance datasets ready for model training.

    5.1 Graph Generator Service

    The edges generator receives a versioned list of attributes as input and produces a CSV file representing signed relationships between attributes.

    The input schema includes fields such as:

    • id (unique identifier)
    • value (e.g. Nike, Fitness Enthusiasts, Instagram)
    • modality (e.g. brand, audience, platform, geo)
    • attribute_type (e.g. interest, age, country, city)

    The output is a signed edge file in CSV format, where each row represents a relationship between two attributes and an inferred performance signal.

    The output schema includes the following fields:

    • source: the origin attribute in the relationship
    • target: the related attribute connected to the source
    • rel: the relationship type, typically indicating the modalities involved in the connection (for example, audience_to_brand)
    • sign: the inferred polarity of the relationship, where + indicates a positive association and – indicates a negative association.

    General Configuration

    Both implementations share a fixed set of parameters to ensure that observed differences in graph quality are attributable to the generation strategy rather than configuration.

    ParameterValue
    LLM modelgemini-3.1-flash-lite-preview
    Embedding model (when applied)gemini-embedding-2-preview
    Temperature0.15
    Max attributes per batch call25
    Concurrent LLM workers7

    Temperature is set to 0.15 (near-deterministic but not fully greedy) to reduce output variance across runs while preserving enough flexibility for the model to handle pairings without defaulting mechanically to the same answer.

    The batch size of 25 attributes per call balances context window efficiency against classification degradation: larger batches for quality classification tasks reduce LLM calls but increase the risk of the model losing focus or producing inconsistent signs across a long list of targets (Van Can et al., 2025).

    5.1.1 Brute force approach pipeline

    The brute-force pipeline enumerated all valid cross-modality attribute pairs from the catalogue and submitted each to gemini-3.1-flash-lite-preview for sign classification. To reduce the total number of API calls, pairs were grouped into batches rather than submitted individually. Each pair was assigned one of three labels: positive (+), negative (-), or neutral (0). Neutral pairs were discarded from the output to simulate real data sparsity.

    Prompting strategy

    The attribute-level classification prompt defines labels in terms of real campaign behaviour. A positive label (+) requires an obvious and natural strategic fit, strong audience overlap, clear brand alignment, and a combination that demonstrably drives higher ROI. A negative label (-) applies when the combination creates friction, wastes ad spend, targets conflicting demographics, or causes brand dissonance. Neutral (0) covers pairs with no meaningful performance signal, standard baseline performance, or combinations that are simply unrelated.

    To reduce speculative classifications, the prompt includes an explicit evaluation mindset guardrail: the model is instructed not to invent creative edge cases to justify a pairing, to treat any combination requiring unconventional reasoning as a mismatch, and to default to neutral when uncertain. This guardrail is critical in a batch classification setting, where the model might otherwise be biased toward generating signal on every pair it evaluates. Full prompt templates are provided in Appendix A.

    Figure 2 – Brute-force flow diagram

    5.1.2 Cluster-based approach pipeline

    The clustering pipeline operated on the same attribute catalogue and followed a four-stage process.

    1. Embedding: attributes were embedded using the latest Gemini embedding model (gemini-embedding-2-preview)
    2. Per-modality clustering: attributes were clustered independently per modality using HDBSCAN, preceded by UMAP dimensionality reduction for modalities with 20 or more attributes, and PCA as a fallback for smaller groups above the dimensionality threshold. Noise points were reassigned to the nearest cluster centroid and flagged to be handled later in the attribute-level expansion.
    3. Cluster pairs batch scoring: each cluster was labelled by the LLM and cross-modality cluster pairs were scored on a 0–10 compatibility scale. Pairs scoring above 7.0 were classified as positive, below 4.0 as negative, and pairs in the neutral band were discarded, reducing the attribute-pair evaluation space from O(n²) to the subset contained within surviving cluster combinations.
    4. Attribute-level edge expansion: for each surviving cluster pair, attribute-level edges were inferred through batched LLM classification. A shortcut was applied to reduce LLM calls: cluster pairs with extreme scores (above 9.0 or below 2.0) had their attribute-level edges auto-assigned directly from the cluster polarity, bypassing LLM classification entirely. However, noise-flagged attributes were excluded from this shortcut and routed back to LLM calls. This avoids propagating unreliable assignments.

    Prompting strategy

    Implementation B uses four distinct prompt types. A cluster labelling prompt assigns each cluster a short label and summary, used as input to the next step. A cluster compatibility prompt scores cross-modality cluster pairs on a 0-10 numeric scale, enabling the neutral band filter and extreme score shortcuts that reduce downstream LLM calls.

    The attribute-level classification prompt is shared with Implementation A and uses the same guard-rails. Sharing prompts across both implementations keeps the final edge classification consistent, making the two approaches as comparable as possible given the non-deterministic nature of LLM outputs. Full prompt templates are provided in Appendix B.

    Configuration

    ParameterValue
    Clustering algorithmHDBSCAN (Euclidean)
    Dimensionality reductionUMAP (≥ 20 attributes), PCA (< 20)
    UMAP output dimensions15
    HDBSCAN min cluster size2
    UMAP distance metricCosine
    Min attributes to cluster15
    Embedding normalisationL2
    Positive cluster threshold≥ 7.0
    Negative cluster threshold≤ 4.0
    Neutral band (discarded)4.0 – 7.0
    Auto-inherit positive threshold*≥ 8.0
    Auto-inherit negative threshold*≤ 3.0

    *The auto-inherit threshold must be tuned according to how well the attribute catalogue clusters. When clustering quality is high and cluster boundaries are semantically meaningful, a looser threshold allows more attributes to inherit their cluster-level scores with confidence. When clustering is poor or produces noisy assignments, a stricter threshold is necessary to avoid propagating unreliable scores down to the attribute level.

    Figure 3 – Cluster-based implementation flow diagram

    5.2 Synthetic Dataset Generator

    The synthetic dataset generator is an API-driven service that creates synthetic marketing campaign records from a signed graph and a set of configurable sampling constraints. It is designed to support multiple data science teams and use cases by allowing users to generate tailored datasets from a shared graph input while keeping the process reproducible through versioned inputs and outputs.

    Operational Architecture

    The generator runs as a Cloud Run Job triggered asynchronously by a dedicated API service, also deployed on Cloud Run. The API handles request validation, stores versioned inputs and configurations in Cloud Storage Buckets, and exposes endpoints to monitor job status and retrieve results.

    Separating request handling from sampling execution means long-running jobs do not block the API layer, and both services scale independently. All inputs, configurations, and outputs are versioned and persisted in the cloud, making every run fully reproducible.

    Input

    The service takes as input a versioned edges.csv file and a generation configuration defining attribute-type ranges, sample counts per label, and target positive and negative edge-fraction bounds.

    Sampling Algorithm

    The core of the generator is a parallelised simulated annealing sampler. Given a signed graph and a set of constraints as input, the sampler starts from a random subgraph and iteratively proposes node swaps, accepting or rejecting each swap based on how well the resulting subgraph satisfies the hard constraints. Worse solutions can be accepted with a probability that decreases over time, controlled by a temperature schedule. This allows the search to escape local optima early in the run and converge toward valid solutions as temperature drops.

    Each sample is generated independently with its own random seed, and all samples within a label class are produced in parallel across CPU workers, making generation time roughly constant regardless of the number of samples requested.

    Configuration Parameters

    Sampling parameters are divided into hard and soft constraints. Hard constraints are strictly enforced and, if too strict or mutually incompatible, generation will fail. Soft constraints are optimisation objectives, therefore the sampler maximises or minimises them during the search but does not block generation if they are not fully met.

    The service supports the following parameters:

    1. Type ranges (hard constraint): define how many nodes of each attribute type must appear in each sample.

    For example, setting geo location to [1, 2] and audience to [2, 4] means every generated record will contain one or two geos and between two and four audience attributes.

    2. Label fraction ranges (hard constraint): define the distribution of the samples across different performance labels.

    In a real marketing dataset, campaign performance metrics are classified into discrete labels or classes such as positive, average, and negative. The synthetic generator mirrors this structure: each label is assigned its own target range for positive and negative edge fractions, which the sampler must satisfy when constructing each record.

    Two parameters define the edge composition of each class:

    • pos_frac_range: the acceptable interval for the fraction of positive edges in the sampled subgraph.
    • neg_frac_range: the acceptable interval for the fraction of negative edges in the sampled subgraph.

    A third parameter, num_samples, sets how many records to generate for that class, which directly controls class balance in the final dataset.

    For example, a positive class configured with pos_frac_range: [0.2, 1.0] and neg_frac_range: [0.0, 0.3] will only accept subgraphs where at least 20% of edges are positive and no more than 30% are negative. A negative class does the opposite, requiring low positive and high negative fractions to represent structurally conflicting combinations. An average class targets overlapping mid-range fractions for both, capturing ambiguous or mixed-signal configurations.

    Example of fraction ranges:

    "label_params": {
      "pos": {
        "pos_frac_range": [0.2, 1.0],
        "neg_frac_range": [0.0, 0.3],
        "num_samples": 430
      },
      "neg": {
        "pos_frac_range": [0.0, 0.2],
        "neg_frac_range": [0.2, 1.0],
        "num_samples": 430
      },
      "avg": {
        "pos_frac_range": [0.10, 0.50],
        "neg_frac_range": [0.10, 0.50],
        "num_samples": 440
      }
    }

    3. High and low-priority type pairs (soft constraint): rules to favour or discourage specific samples.

    High-priority type pairs instruct the sampler to favour subgraphs where specific modality combinations, such as brand-to-audience or platform-to-geo, are well connected. This is useful when a downstream model needs to learn from dense signals between particular attribute types.

    Low-priority type pairs do the opposite by discouraging connections between selected modalities, allowing the generator to simulate scenarios where certain attribute combinations are structurally sparse or irrelevant.

    4. Simulated annealing sampler parameters: control the behaviour of the sampler algorithm itself.

    • max_iters sets the total iteration budget per sample, trading off generation speed against solution quality
    • start_temp and end_temp define the annealing temperature schedule, controlling how aggressively the algorithm explores versus exploits candidate subgraphs early and late in the search
    • patience sets how many iterations without improvement trigger an early stop, avoiding wasted compute when a good solution has already been found
    • proposals_per_step controls how many node swap candidates are evaluated at each iteration, allowing the search to cover more of the graph space per step

    Taken together, these controls mean the same generator and the same signed graph can produce datasets with very different statistical profiles, varying class balance, edge density, modality coverage, and noise levels. Multiple different datasets can simply be built by changing the configuration, without rebuilding any part of the pipeline.

    Output and Evaluation

    After sampling, the service evaluates the generated samples against the requested edge-fraction targets. The current evaluation includes:

    • average positive and negative edge distribution
    • average deviation from target ranges
    • in-range rate, i.e. the proportion of generated samples that satisfy the requested bounds

    The final output is written as a versioned JSONL file, where each line represents one synthetic row of a marketing campaign performance dataset. Each record contains the sampled attributes grouped by type and a label field indicating the intended performance class.


    6. Methodology

    6.1 Signed Graph Generation

    The evaluation of the graph generation pipeline is organized around four complementary angles: computational efficiency, sign consistency, controlled correctness, and dataset generation feasibility.

    Computational efficiency

    The primary comparison between Implementation A and Implementation B is conducted on the same versioned attribute catalogue. The following metrics are recorded for each run:

    • total LLM calls made
    • total edges generated
    • LLM calls per final edge
    • total execution time
    • sign distribution of the output graph (positive, negative, neutral)

    This measures how much of the quadratic cost is recovered by the cluster pre-filter strategy without sacrificing edge coverage.

    Sign consistency

    For all attribute pairs that appear in both implementations’ outputs, the sign assigned by each approach is compared directly.

    The agreement rate across overlapping pairs serves as a proxy for signal reliability: if two independent generation strategies (one exhaustive, one cluster-filtered) converge on the same sign for a pair, that convergence is evidence the edge reflects a real compatibility signal rather than LLM noise. Disagreements are analysed by modality pairs to identify which attribute combinations produce the most inconsistent classifications.

    Controlled correctness check

    A small auxiliary attribute catalogue is constructed manually, containing three categories of pairs:

    • Obviously positive: pairings with clear and unambiguous marketing alignment (e.g. Nike + Sports Enthusiasts, Sephora + Beauty & Personal Care)
    • Obviously negative: pairings with clear brand or audience conflict (e.g. Luxury brand + Budget shoppers)
    • Ambiguous: pairings with no strong prior, where reasonable disagreement is expected

    Graph density and sampling feasibility

    A graph that is semantically correct but structurally too sparse cannot support constrained dataset generation. This angle evaluates, by running a diagnose script, whether the generated graph is dense enough to serve as a valid input to the synthetic dataset generator under realistic label configurations.

    6.2 Synthetic Dataset Generation

    Unlike the graph generator, the synthetic dataset generator is not itself the subject of a controlled experiment. Its correctness is structural: given a valid signed graph and a feasible set of constraints, it either produces samples that satisfy the requested edge-fraction bounds or it does not. The meaningful question is therefore not whether the generator works, but what it made possible — and whether the datasets it produced were usable enough to drive real modelling work.

    Diagnosing the generated synthetic dataset

    Per-label sample quality was assessed using three aggregate metrics:

    • Average edge distribution: the mean positive and negative edge fractions across generated samples;
    • Average deviation: the mean distance between the observed edge fractions and the nearest boundary of the requested target ranges;
    • In-range rate: the proportion of generated samples whose positive and negative edge fractions simultaneously fall within the configured bounds.

    Together, these measurements characterise whether the sampler is producing structurally valid datasets for a given graph and parameterisation, and serve as the primary signal for deciding whether a generation run is fit for downstream use.


    7. Results — Graph Generator

    7.1 Small Catalogue

    Brute force approach

    The brute force run against the 27-node small catalogue (10 brands, 10 audiences, 4 countries and 3 platforms) produced 191 valid edges from 252 possible cross-modality pairs in 54.1 seconds across 54 LLM calls.

    LLM signalDistribution
    Positive edges147 (77%)
    Negative edges44 (23%)
    Neutral / filtered61 (24%)

    Graph quality

    Inspecting the output confirms the graph encodes meaningful marketing compatibility signals. A few representative examples across edge types:

    Obvious positives — clear strategic fit correctly assigned +:

    • Athletes → Nike (audience → brand)
    • Winter Sports Enthusiasts → The North Face (audience → brand)
    • Entrepreneurs → Salesforce (audience → brand)
    • K-Pop Fans → TikTok (audience → platform)
    • C-Suite Executives → LinkedIn (audience → platform)
    • Itau → Brazil (brand → geo)

    Obvious negatives — clear conflict or mismatch correctly assigned -:

    • Budget Shoppers → Rolex (audience → brand)
    • Salesforce → K-Pop Fans (audience → brand)
    • Winter Sports Enthusiasts → Brazil (audience → geo)
    • Rolex → Wish (brand → platform)
    • Salesforce → TikTok (brand → platform)

    Neutral / filtered — average or no meaningful signal correctly discarded:

    • Women 18-34 → all geos
    • Fitness Enthusiasts → Dior (audience → brand)
    • Beauty Enthusiasts → Red Bull (audience → brand)
    • Dior → TikTok (brand → platform)

    Cluster-based approach

    The cluster-based run completed in under 20 seconds with approximately 40 LLM calls — a 26% reduction versus brute force — while producing a graph of comparable negative-edge quality.

    MetricBrute ForceCluster-Based
    LLM calls5440
    Positive edges147125
    Negative edges4441
    Total edges191166

    Clusters found in this experiment:

    • Luxury Fashion Houses → Dior, Chanel
    • Global Lifestyle Brands → The North Face, Havaianas, Vans, Nike, Salesforce (outlier not correctly assigned by HDBSCAN)

    Small catalogue constraints

    The 26% reduction in LLM calls is modest and reflects a fundamental limitation of applying the cluster-based approach to small catalogues. With few attributes per modality, the embedding space is too sparse for HDBSCAN to form meaningful dense regions, producing singleton or near-singleton clusters that compress the attribute space very little. As a result, most savings come from the neutral band filter rather than cluster compression.

    With only 10 attributes per modality, the catalogue sits at the lower boundary of what density-based clustering can reliably resolve. At this scale, several compounding factors make cluster assignments untrustworthy: the geometric structure of the embedding space is fragile, meaning clusters may be more likely driven by embedding noise than genuine semantic similarity.

    For a 10-attribute modality, all 10 clusters are singletons, and the cluster compatibility scoring step evaluates the same number of pairs as brute-force approach classification would.

    7.2 Evaluating LLM-Efficiency for a Larger Catalogue

    Experiment setup

    The large catalogue experiment ran both implementations against a versioned catalogue of 160 attributes: 60 brands, 60 audiences, 10 platforms, and 30 countries. This scale produces a cross-modality pair space of 8700 possible pairs, making brute-force classification substantially more expensive than in the small catalogue case and providing a more meaningful surface for evaluating cluster-based compression.

    Both runs used identical configuration: gemini-3.1-flash-lite-preview at temperature 0.15, batches of 25 attributes per LLM call, and 7 concurrent workers.

    Computational Efficiency

    MetricBrute-forceCluster-BasedΔ
    Batches per LLM call2525
    Total LLM calls570265−53%
    Total execution time553s274s−50.5%
    Total edges generated5,1314,649−9.4%
    Edges per LLM call~7.67~17.5+128%
    Positive edges4,3724,092
    Negative edges759557

    The cluster-based approach cut both LLM call volume and wall-clock time by approximately half while retaining 90.6% of total output edge volume. The improvement in edges per LLM call — from 7.67 to 17.5 — directly reflects the mechanism at work: by routing calls only toward attribute pairs that survive the cluster-pair pre-filter, each batch operates in a denser region of the relevance space rather than sweeping uniformly across the full pair surface.

    Agreement metrics between the cluster-based vs brute-force graph

    MetricValue
    Common edges (present in both outputs)3,476 of 5,131 — 67.7% of brute-force
    Sign agreement on common edges3,406 of 3,476 — 98.0%
    Sign mismatches70
    Edges missing from cluster-based output1,655
    Edges unique to cluster-based output1,173
    Edge recall vs. brute-force67.7%
    Sign precision on recovered edges98.0%

    Where both approaches generate an edge for the same attribute pair, they agree on its polarity in 98.0% of cases. This is the most important quality signal in the comparison: it confirms that the cluster-based approach does not systematically distort relationship sign. The gap between the two outputs is almost entirely a coverage gap and not a correctness gap. The 32.3% of brute-force edges absent from the cluster-based output (1,655 pairs) represent pairs that the pre-filter chose not to route to the LLM, not pairs that were scored incorrectly.

    The 1,173 edges unique to the cluster-based graph shows pairs that the cluster approach evaluated and retained, but which brute-force scored as neutral and discarded. This represents a mild over-generation effect — the cluster-pair pre-scoring occasionally surfaces regions of the pair space that brute-force ultimately considers uninformative.

    Catalogue scaling projection for even larger catalogues

    Brute-force LLM call volume scales as O(N²). For each pair type we have:

    (A,B)|A|×|B|25\sum_{(A,B)} |A| \times \left\lceil \frac{|B|}{25} \right\rceil

    So doubling the catalogue quadruples the cost. The cluster-based approach breaks this curve: by replacing exhaustive pair evaluation with a cluster-pair pre-scoring step, call growth tracks closer to O(N · √N), since cluster count grows sub-linearly relative to attribute count.

    Using these as reference points, the projected call volumes for larger catalogues are:

    Catalogue SizeEst. total pairsBF est. LLM callsCB est. LLM callsEst. savings
    160 attributes (current)8,70057026553%
    320 attributes (2×)~34,800~2,280~750~67%
    800 attributes (5×)~217,500~14,250~2,960~79%
    1,600 attributes (10×)~870,000~57,000~8,400~85%

    7.3 Graph Density Analysis

    Before feeding the attribute graph into the dataset generator, a diagnostic pass measures the empirical distribution of edge types across randomly sampled subgraphs. This step is necessary because the Synthetic Dataset generator needs explicit density constraints grounded in the actual structure of the graph, not assumed.

    This density evaluation randomly samples subgraphs according to the same type quotas expected by the Synthetic Dataset Generator but with zero edge-fraction constraints, then measures the fraction of positive, negative, and absent fractions you actually get from the graph’s natural structure.

    Results over the large catalogue graph, sampling 4,000 subgraphs (160 nodes, 4,092 positive edges, 557 negative edges) are:

    MetricMeanp5p25p50p75p95
    pos_frac0.3630.1560.2670.3330.4440.600
    neg_frac0.0360.0000.0000.0220.0560.133
    absent_frac0.6020.3330.5240.6110.7000.810

    The distribution reflects a structurally asymmetric graph: roughly 36% of edges within a typical subgraph are positive60% are absent, and only 3.6% are negative.

    Based on the observed percentiles, the diagnostic suggests the following label_params for the three training label classes:

    pos label:  pos_frac_range: [0.33, 1.0],   neg_frac_range: [0.0,  0.02]
    neg label:  pos_frac_range: [0.0,  0.33],   neg_frac_range: [0.02, 1.0]
    avg label:  pos_frac_range: [0.27, 0.44],   neg_frac_range: [0.00, 0.06]

    8. Results — Synthetic Dataset Generator

    Building a reliable signed graph is non-trivial and it is precisely the problem the graph generator in this study was designed to solve. Before this service was available, two graphs were assembled: an initial one built with direct LLM assistance, and a later iteration grounded in both LLM outputs and real data. Run against both, the generator produced 49 valid datasets. This section demonstrates one representative example end-to-end.

    8.1 Impact and Downstream Use

    With AI-ready real campaign data only becoming available late in the quarter, and still subject to significant coverage limitations, synthetic data was the primary available input for model training and architecture evaluation. The multiple datasets generated across both graphs unblocked multiple experiments that would otherwise have had to wait for real data to be ready.

    The following teams and experiments consumed datasets produced by the generator, each linking to the corresponding technical report where results are documented in detail:

    Campaign Performance Prediction Pod

    • Dataset: Two versions of the V27 (13k rows and 90k) — schema: brand, interest-and-other-attributes, platform, geo, campaign objective, device, gender-generation, media-buy attributes
    • Use: Training and benchmarking ML models, spanning both traditional and deep learning architectures, to classify campaign performance as positive, average, or negative before any real budget is committed

    Multimodal Federated Learning Pod

    • Dataset: Multiple versions of the V28 (baseline, inverse pairs, middle pairs and different configurations of noises) — schema: brand, audience, creative, platform, geo
    • Use: Benchmarking federated learning against centralised training for multimodal campaign outcome classification, across varying numbers of clients, noise levels, and cross-modal relationship complexity; synthetic data was essential here as raw campaign data cannot be shared across organisational boundaries by design

    LLM Finetuning Pod

    • Dataset: V15 (easy, imbalanced), V16 (medium, imbalanced), V17 (hard, imbalanced), V25–V26 (balanced) — schema: brand, audience, creative, platform, geo.
    • Use: Applying Google AlphaEvolve’s evolutionary search framework to automatically discover superior neural architectures, loss functions, and hyperparameters for the campaign performance prediction model, using the internal base model as seed — across synthetic datasets of varying difficulty levels and real data as final validation.

    Model leaderboard

    To consolidate results across the multiple teams and experiments consuming synthetic datasets, a shared Model Leaderboard UI was built as a Streamlit application. The app ingests evaluation logs produced by every model execution and surfaces them in a unified, filterable interface, ranked by the score.

    The dataset version selector on the left panel allows direct comparison across synthetic dataset generations, making it straightforward to assess how model performance evolved as dataset quality, difficulty, and schema improved over time. This made it possible to track progress across teams working independently, identify which architectures and training configurations performed best on a given dataset version, and establish a common benchmark reference point without requiring any manual coordination between experiments.

    Figure 4 – Model Leaderboard UI

    8.2 Real Use Case Example

    Evaluating graph edge density to choose the fraction range constraints

    Following the diagnostic procedure described in Section 6.1, a density analysis was conducted on the purpose-built graph comprising 746 nodes, 36,420 positive edges, and 40,629 negative edges. Across 10,000 randomly sampled subgraphs drawn using the prescribed type quotas and no edge-fraction constraints, the following distribution was observed:

    MetricMeanp5p25p50p75p95
    pos_frac0.1750.0900.1330.1700.2100.275
    neg_frac0.2100.1160.1670.2060.2480.319
    absent_frac0.6150.5130.5820.6250.6540.690

    The results show a moderately sparse graph (roughly 61% of node pairs carry no edge), where positive and negative edge densities are naturally close to each other. This leaves limited room to position label boundaries without overlap. If the pos and neg label constraints overlap significantly, the generator produces structurally indistinguishable samples, undermining the dataset’s discriminative value.

    Based on the observed percentiles, the following label_params were derived:

    pos label:  pos_frac_range: [0.17, 1.0],   neg_frac_range: [0.0,  0.21]
    neg label:  pos_frac_range: [0.0,  0.17],   neg_frac_range: [0.21, 1.0]
    avg label:  pos_frac_range: [0.13, 0.21],   neg_frac_range: [0.17, 0.25]
    

    Input configuration

    ParameterValue
    VersionV27_eirini-13
    Brand quota1
    Campaign-objective quota1
    Device quota1
    Gender-generation quota3–6
    Geo quota1–3
    Interest-and-other-attributes quota1–5
    Media-buy-billing-event quota0–1
    Brand-safety-content-filter-levels quota0–2
    Campaign-buying-type quota0–1
    Media-buy-cost-model quota0–1
    Platform quota1
    Total samples requested13,000
    Samples per label class4,300 pos / 4,300 neg / 4,400 avg
    Simulated annealing: max iterations10,000
    Simulated annealing: start temperature1.0
    Simulated annealing: end temperature0.002
    Simulated annealing: patience5,000

    The label bounds used in this run were set slightly wider than the diagnostic-suggested baseline to provide the sampler with more feasible solution space, particularly for the avg class where the diagnostic revealed a narrow overlap region between positive and negative fractions:

    pos label:  pos_frac_range: [0.2, 1.0],    neg_frac_range: [0.0, 0.3]
    neg label:  pos_frac_range: [0.0, 0.2],    neg_frac_range: [0.2, 1.0]
    avg label:  pos_frac_range: [0.1, 0.5],    neg_frac_range: [0.1, 0.5]
    

    Output quality

    LabelSamples generatedAvg pos_fracAvg neg_fracAvg deviationIn-range rate
    pos4,3000.2250.1590.0100%
    neg4,3000.1430.2400.0100%
    avg4,4000.1780.2040.0100%

    The generation run achieved a 100% in-range rate across all three label classes, with zero average deviation from the requested bounds. Every one of the 13,000 generated samples satisfies its configured edge-fraction constraints, confirming that the calibrated label bounds derived from the density diagnostic were both feasible and well-matched to the graph’s structural properties. The sampler converged cleanly across all label classes, with no samples requiring rejection or resampling.

    {
      "brand": ["aperol"],
      "campaign-objective": ["Video views"],
      "campaign-buying-type": ["Auction"],
      "platform": ["audience network rewarded video"],
      "device": ["mobile web iphone"],
      "geo": ["Costa Rica", "Gambia"],
      "gender-generation": [
        "Baby Boomers",
        "Gen X",
        "Gen Z",
        "Male",
        "Millennials",
        "Gen Alpha"
      ],
      "interest-and-other-attributes": [
        "Entrepreneurship and Sales",
        "Lingerie and Intimate Wear",
        "Luxury German Automobiles",
        "Sci-Fi and Space Exploration",
        "Weddings and Marriage"
      ],
      "media-buy-billing-event": ["Thruplay"],
      "media-buy-cost-model": ["Automatic objective"],
      "label": "pos"
    }

    9. Limitations and Open Issues

    The current system has several limitations that should be addressed in future work:

    Lack of empirical validation for generated edges

    The graph’s edges are assumed to encode reliable marketing compatibility signals, but this assumption is never empirically tested against real campaign outcomes. The controlled correctness check validates only a small set of manually curated obvious cases, covering clear positives and clear negatives. This confirms the LLM can handle unambiguous pairings, but says nothing about the accuracy of the thousands of borderline edges that make up the bulk of the graph.

    Graph over-connectivity in brute-force

    The brute-force approach frames every attribute pair as worth relating, introducing framing bias and producing denser graphs than the real compatibility space warrants. Low-confidence edges that a more selective approach would discard pass through classification, potentially propagating spurious marketing relationships into downstream datasets and models.

    HDBSCAN sensitivity

    The cluster-based approach is sensitive to embedding space density and HDBSCAN parameterisation. At small catalogue sizes, sparse embedding spaces produce fragile cluster geometries that may reflect noise rather than genuine semantic similarity. These parameters are not currently exposed as configurable inputs, but this can be addressed in future implementations.

    Synthetic dataset generator requires a dense graph

    The sampler depends on the input graph having sufficient positive and negative edges to satisfy configured fraction bounds. Sparse graphs (particularly in negative edges) directly constrain the feasible label configuration space and can cause generation to fail or degrade.


    10. Recommendations for Next Quarters

    Productionise the cluster-based approach with safe self-service access

    The cluster-based implementation has not yet been integrated into the production API. Productionising it should be accompanied by the guardrails necessary for safe self-service use: per-run call budgets, pre-flight cost estimation from catalogue size, input validation, and retry logic. Both steps should be treated as a single delivery, since opening the service to other teams without usage controls risks runaway API consumption regardless of which generation strategy is exposed.

    Alongside this, a set of currently hardcoded parameters should be exposed as service configuration. Clustering behaviour, such as HDBSCAN minimum cluster size and UMAP output dimensions, directly affects edge quality and should be tunable for teams with atypical catalogue structures. LLM customisation, such as model selection, temperature, prompts, and batch size, should also be enabled, allowing teams to trade cost against output quality based on their use case.

    Run the graph generator against the production catalogue

    The 49 datasets documented in this report were generated from manually assembled graphs. The next step is running the graph generator against the production attribute catalogue, which would constitute the first end-to-end test of the full system under real operational conditions.

    Grounded graph generation with a Knowledge Augmented Generation framework

    Rather than relying solely on the LLM’s pre-trained knowledge to score attribute pairs, a KAG framework would construct a structured knowledge base from real sources. This knowledge base will be exposed to the LLM through a context management layer at classification time enabling grounded decisions. When scoring a pair like Brand X ↔ Audience Y, the LLM would receive retrieved context from multiple grounded sources: historical campaign performance, expert rules and market considerations, and BAV survey data capturing empirical brand-audience affinities. This directly addresses the most critical pipeline limitations: the lack of empirical validation for generated edges.

    Open source the synthetic data pipeline

    Open sourcing the pipeline, or at a minimum the graph generation framework, would invite external contributions, surface edge cases from diverse catalogue structures, and position the team as contributors to the broader ML and marketing intelligence community.


    11. References

    Van Can, A. T., Aydemir, F. B., & Dalpiaz, F. (2025). One size does not fit all: On the role of batch size in classifying requirements with LLMs. In Proceedings of the 2025 IEEE 33rd International Requirements Engineering Conference Workshops (REW 2025) (pp. 30–39). IEEE.

    Tam, Z. R., Wu, C.-K., Tsai, Y.-L., Lin, C.-Y., Lee, H.-Y., & Chen, Y.-N. (2024). Let me speak freely? A study on the impact of format restrictions on performance of large language models. arXiv:2408.02442https://doi.org/10.48550/arXiv.2408.02442

    Delahaye, D., Chaimatanan, S., & Mongeau, M. (2019). Simulated annealing: From basics to applications. In M. Gendreau & J.-Y. Potvin (Eds.), Handbook of Metaheuristics (Vol. 272, pp. 1–35). Springer. https://doi.org/10.1007/978-3-319-91086-4_1


    Appendix A – Prompts for brute-force approach

    _CLASSIFICATION_SCALE = """CLASSIFICATION SCALE:
    
    - NEGATIVE SIGNAL (mismatch): The combination creates friction, wastes ad spend, targets conflicting demographics, or causes brand dissonance. 
    0 NEUTRAL or NO SIGNAL: No meaningful impact, standard baseline performance, or completely unrelated.
    + POSITIVE SIGNAL (synergy): The combination clearly enhances the campaign, has strong audience overlap, and drives higher ROI.
    
    EVALUATION MINDSET: 
    Do not invent creative edge cases to make a pairing work. 
    If a pairing requires mental gymnastics or an unconventional strategy to succeed, it is a mismatch (-). 
    Assign (+) only when the strategic fit is obvious and natural.
    Default to (0) if unsure.
    """
    python
    BATCH_CLASSIFICATION_PROMPT = (
    "You are a social media campaign expert.\n\n"
    "TASK INTRODUCTION\n"
    "You will evaluate how a primary marketing campaign attribute relates to "
    "a set of other attributes from different categories.\n\n"
    "PRIMARY ATTRIBUTE: \"{primary_attr}\"\n"
    "CATEGORY: {primary_mod}\n\n"
    "For each numbered attribute below, decide whether pairing it with the primary attribute "
    "in a marketing campaign creates synergy (+), a mismatch (-), or neither (0).\n\n"
    "ATTRIBUTES TO EVALUATE:\n"
    "{attr_list}\n\n"
    + _CLASSIFICATION_SCALE
    + "\n\n"
    "RESPONSE FORMAT:\n"
    "Return ONLY a JSON object mapping each attribute's BRACKET NUMBER to its sign.\n"
    "Use the numbers shown in [brackets] before each attribute. Do NOT use attribute names as keys.\n\n"
    "Example — if attributes are \"[1] Attribute-B\", \"[2] Attribute-C\", \"[3] Attribute-D\":\n"
    "{{\"1\": \"+\", \"2\": \"-\", \"3\": \"0\"}}\n\n"
    "JSON OUTPUT:"
    )
    

    Appendix B – Prompts for cluster-based approach

    ENRICHMENT_PROMPTS = {
    "Brand": (
        "Describe {value} across these dimensions, being specific about "
        "what makes it DIFFERENT from other brands. Be concrete and avoid "
        "generic marketing language.\n\n"
        "1. Core identity: what it fundamentally stands for (not a tagline)\n"
        "2. Who would NEVER be its customer (anti-audience)\n"
        "3. Price tier vs direct competitors "
        "   (budget / mid / premium / luxury)\n"
        "4. Cultural associations, lifestyle, or subculture it belongs to\n"
        "5. Any restrictions, controversies, or advertiser safety concerns.\n\n"
        "Answer in 6-7 sentences total."
    ),
    "Audience": (
        "Describe this audience segment for marketing targeting purposes.\n"
        "Be specific and differentiating — assume this description will be used\n"
        "to separate this segment from all other segments in a clustering algorithm.\n\n"
        "Cover these dimensions:\n"
        "1. Demographics & income: age range, income tier, life stage\n"
        "2. Purchase behavior: price sensitivity, brand loyalty, research depth\n"
        "3. Core motivation: what fundamentally drives their purchases?\n"
        "4. Digital behavior: which platforms they actively use vs avoid\n"
        "5. What they are NOT: name 2 audience types they are commonly confused\n"
        "   with, and explain precisely what distinguishes them\n\n"
        "The last point is critical — explicitly contrast this segment\n"
        "against its nearest neighbours.\n\n"
        "Audience segment: {value}"
    ),
    "Geo": (
        "Describe {value} for digital advertising targeting purposes.\n"
        "Be balanced and specific — describe the FULL population, not just "
        "the most attractive or dominant segment.\n\n"
        "Cover these dimensions:\n"
        "1. Income distribution: what proportion of the online population is "
        "   budget/price-sensitive, mid-income, premium, and high-income? "
        "   Name specific segments that exist (e.g. deal-hunters, middle class, "
        "   luxury consumers) — do not flatten this into a single tier.\n"
        "2. Platform landscape: which platforms dominate, which are absent or "
        "   blocked, and which are growing.\n"
        "3. Consumer attitudes: cultural attitudes toward advertising, brand "
        "   trust, and price sensitivity across different segments.\n"
        "4. Language and regional fragmentation: languages spoken, regional "
        "   differences in purchasing behaviour.\n"
        "5. Regulatory or content restrictions advertisers face here.\n\n"
        "Answer in 5-6 sentences total. Do not focus only on high-value "
        "consumers — the description must reflect the full spectrum."
    ),
    "Platform": (
        "Describe {value} across these dimensions, focusing on what makes it "
        "DISTINCT from other digital platforms for advertisers.\n"
        "- Core user demographic (age, income, intent)\n"
        "- Content format and consumption behaviour (passive scroll vs active search)\n"
        "- Price tier of content that performs well: explicitly mention whether "
        "  budget/dupe/deal content thrives alongside premium content, or if the "
        "  platform skews exclusively toward one end.\n"
        "- Ad formats available and their typical performance characteristics\n"
        "- Categories of brands that perform well vs categories that underperform\n"
        "- Brand safety profile and content moderation reputation\n"
        "Answer in 4-5 sentences total."
    ),
    }
    python
    ENRICHMENT_FALLBACK_PROMPT = (
    "Describe '{value}' (type: {attribute_type}) focusing on what makes it DISTINCT "
    "from similar entities: who it targets, who it excludes, price positioning, "
    "cultural associations, and any restrictions or controversies. "
    "Be concrete and avoid generic marketing language. Answer in 4-5 sentences."
    )
    python
    CLUSTER_LABEL_PROMPT = """
    You are a marketing strategy expert.
    
    The following {modality} attributes were grouped together based on semantic similarity:
    {members}
    
    Provide in JSON with keys "label" (2-4 words), "summary" (one sentence),
    "characteristics" (array of exactly 3 strings).
    """
    python
    CLUSTER_COMPAT_BATCH_PROMPT = """You are a media planning expert. Score how well a source cluster performs with each target cluster in an ad campaign.
    
    SOURCE — {modality_a}: {label_a}
    {desc_a}
    Members: {members_a}
    
    TARGET CLUSTERS:
    {targets}
    
    For each cluster target, return a compatibility score (0–10) or null if the combination is implausible, prohibited, or has no shared business context.
    - 0–3 : Negative — friction, conflicting demographics, brand dissonance
    - 4–6 : Neutral  — baseline / no meaningful impact
    - 7–10: Positive — clear synergy, strong audience overlap, higher ROI (only assign 10 if you believe all attributes in the target cluster would perform well with the source cluster)
    - null : Not a viable campaign combination (not the same as a low score)
    
    RESPONSE FORMAT:
    Return ONLY a JSON object mapping each cluster's BRACKET NUMBER to its compatibility score (integer 0–10) or null.
    Use the numbers shown in [brackets] before each cluster. Do NOT use cluster names as keys.
    
    Example — if clusters are "[1] Luxury Fashion Houses", "[2] Mass Market Brands", "[3] Sports Apparel":
    {{"1": 8, "2": 3, "3": null}}
    JSON OUTPUT:
    """
    
  • Using Synthetic Data to Train and Stress-Test Marketing Machine Learning Models

    Unlocking machine learning experiments across multiple teams with a synthetic data pipeline grounded in marketing knowledge

    Training Machine Learning (ML) models for marketing usually starts with a hard requirement: labelled data that links campaign settings and attributes to actual performance outcomes. You collect campaigns, look at what combinations of brand, audience, platform, and geography performed well, and train a model to learn from those patterns.

    In theory, that sounds straightforward, but in practice, real data is hard to clean and structure, arrives slowly, takes time to accumulate and only reflects combinations you’ve already run. If you’ve never targeted a certain audience on a certain platform for a certain brand, that example simply doesn’t exist in the dataset. And if multiple teams are waiting on that data before they can even begin experimenting, progress stalls fast.

    We ran into exactly that problem.

    We needed a way to start training and benchmarking marketing ML systems before AI-ready real campaign data was available at a useful scale. So instead of waiting for the data, we built a synthetic data pipeline that could generate realistic, labelled training data grounded in how marketing actually works.

    That pipeline ended up unblocking model experiments across multiple teams.

    The Problem With Random Synthetic Data

    Real campaign data is essentially rows of campaign attributes (brand, audience, location, platform, placement, creative, and many more) each labelled with how that combination performed. That’s what a model learns from.

    This kind of data is easy to fake badly. You can always create random combinations of attributes and assign them labels. But for marketing, random is worse than useless if it ignores real-world compatibility. A luxury brand paired with bargain-hunting audiences, or a B2B enterprise software brand matched with a fashion lifestyle platform, doesn’t help an ML model learn. It teaches the wrong lessons.

    So the challenge wasn’t just “generate fake data”. It was:

    1. Capture that marketing knowledge in a structured, machine-readable form
    2. Use that structure to generate realistic campaign configurations at scale

    What we needed was a structured way to encode that compatibility: given any combination of campaign settings, does it make sense or not?

    Encoding Marketing Knowledge as a Graph

    We chose a versatile structure: a graph.

    In a marketing knowledge:

    • Nodes represent attribute values for different modalities, such as brand, audience, platform, country, and any other factor that can influence the outcome of a campaign.
    • Edges represent compatibility between two attributes:
      • A positive edge (+) means the pair is expected to work well together within the same campaign.
      • A negative edge (-) means the pair is a bad fit, likely to damage the cohesiveness and the performance of the campaign.
      • No edge means there’s no meaningful signal. A neutral fit.

    That gives us a machine-readable map of marketing relationships.

    Some simple examples:

    • LinkedInC-Suite Executives → positive
    • Luxury brandBudget shoppers → negative
    • SalesforceTikTok → negative
    • AdidasK-pop fans → positive

    This structure worked well for three reasons:

    • It naturally captures many-to-many relationships
    • It’s easy to extend with new brands, audiences, and platforms
    • It’s interpretable enough for humans to inspect and validate

    Once you have that graph, you can start generating synthetic campaign examples that are constrained by actual compatibility signals instead of randomness.

    The Bottleneck: Building the Graph was Expensive

    The obvious way to build this graph was to leverage the capabilities of Large Language Models to classify every possible pair of attributes from a catalogue of brands, audiences, geographies, and other marketing settings of interest.

    That approach can work for small catalogues, such as 20 brands, 50 audiences, 10 countries, and 5 platforms. But those are not especially useful in practice, since ML models need data that is both diverse and high-volume.

    As the catalogue grows, pairwise combinations quickly become a bottleneck. Even a moderately sized catalogue creates thousands of cross-modality pairs. As the number of attributes increases, the number of possible pairs grows quadratically. That made a brute-force approach too slow and too expensive for routine iteration. Even considering batch calls, like a primary attribute compared to a target list of attributes, it would still be too much.

    So we needed a way to build the graph without evaluating the entire space of possible combinations.

    But that creates an obvious dilemma: how do you find the important pairs without first checking them all?

    Two Ways We Approached Graph Generation

    To answer that question, we implemented and compared two graph generation strategies.

    1. Batched brute-force pair classification

    A truly naive strategy would have been to ask the LLM about every single attribute pair one by one, but we did not test that because it is clearly too inefficient to be practical.

    Instead, for each valid cross-modality combination, we selected one primary attribute and asked the LLM to classify its relationship to a batch of up to 25 target attributes as positive, negative, or neutral.

    The batch size of 25 was chosen deliberately:

    Prior work shows that batch size affects LLM classification quality: larger batches are more efficient, but can reduce consistency across judgments. We therefore set the batch size as a practical trade-off between efficiency and quality.

    This gave us a strong reference point: broad coverage with a simple implementation, useful for evaluating whether a more efficient method could preserve similar graph quality without the same cost.

    2. Cluster-first graph generation

    The second approach was designed to reduce the search space before asking the LLM to score anything.

    Instead of classifying every attribute pair directly, we first:

    • embedded the attributes and applied UMAP for dimensionality reduction,
    • clustered them by modality using HDBSCAN,
    • asked the LLM to batch score compatibility between clusters,
    • discarded neutral cluster pairs and their attribute combinations,
    • automatically assigned scores to attribute combinations derived from high-confidence cluster pairs,
    • and asked the LLM to batch classify only the remaining attribute pairs.

    This turned a very large search space into a much smaller one, so the LLM spent time only where useful signals were more likely to exist.

    For small catalogues, the efficiency gains are smaller because many attributes end up as singleton clusters, but the same architecture still applies.

    What Happened When We Compared Them

    On a larger catalogue of 160 attributes — 60 brands, 60 audiences, 10 platforms, and 30 countries — the cluster-based approach performed much better operationally.

    Compared with brute force, it delivered:

    • 53% fewer LLM calls
    • 50.5% less execution time
    • 90.6% of the total edge volume retained

    More importantly, where both methods produced an edge for the same pair, they agreed on the sign 98% of the time. This shows that the cluster-based approach is not systematically changing the meaning of the relationships it recovered.

    The main trade-off was coverage: some pairs found by brute force were filtered out before attribute-level scoring, likely around lower-signal or more borderline cases.

    In practice, this gave us a much cheaper way to generate the graph while preserving the compatibility signal that mattered most.

    The scaling advantage becomes even clearer when projected to larger catalogues:

    Batched brute-forceBatched cluster-based
    Catalog Size**Total pairs *****LLM calls *****LLM calls ***
    160 attributes8,700570265
    320 attributes (2×)~34,800~2,280~750
    800 attributes (5×)~217,500~14,250~2,960
    1,600 attributes (10×)~870,000~57,000~8,400
    • These are directional estimates extrapolated from the 160-attribute experiment. Actual call volumes will vary with catalogue structure, clustering behavior, and graph densities.

    From Graph to Actual Training Data

    Once we had a signed graph, the next problem was turning it into an actual labelled campaign performance dataset.

    Each row in this dataset represents one synthetic campaign configuration (a combination of attributes drawn from the graph) along with a performance label: pos, neg, or avg. That label is the training target. It describes whether the overall campaign combination is expected to perform well, underperform, or land somewhere in between.

    Important note: The label is not the same as a graph edge. Edges score pairs of attributes; the label scores the whole configuration, aggregated across all its edges signs.

    Figure 1 – Example of row from the campaign performance dataset

    This dataset is the output of the second service in the pipeline: the Synthetic Dataset Generator. Its job is to create synthetic campaign records from the graph while respecting configurable constraints such as:

    • how many attributes of each type should appear in each sample,
    • how many positive, negative, and average examples to produce,
    • and what proportion of positive vs. negative edges each label class should contain.

    For example, a positive sample might require a relatively high fraction of positive edges and a low fraction of negative ones. A negative sample would do the opposite, while an average sample would contain more balanced fractions of both.

    That gave us complete control of the dataset. The same graph could generate multiple datasets (with different class balances, difficulty levels, noise profiles, and schemas), just by changing configuration, not rebuilding the pipeline.

    Simulated annealing: searching the graph efficiently

    To find valid combinations to generate each dataset row efficiently, we used a parallelized simulated annealing sampler. The name comes from a steel mill process, where a material is heated and then cooled in a controlled way to reduce defects and settle into a more stable structure.

    Our algorithm follows the same idea. It starts in a “hot” state, exploring many possible campaign configurations and even accepting imperfect ones early on. As it cools, it becomes more selective, swapping attributes in and out until each sample settles into a configuration that satisfies the requested constraints.

    Downstream Impact and ML Experiments Unlocked

    This service was not just a technical exercise. Its purpose was to unblock machine learning workstreams while real campaign data was still limited, not ready, or missing key combinations. Without it, multiple experiments would have been blocked.

    The Synthetic Dataset Generator produced 49 synthetic datasets, built from multiple graph versions and configurations. Those datasets were used to both train and stress-test models across different teams and modelling approaches. Each dataset varied in class balance, difficulty, and noise to probe how models behaved under pressure. Experiments included:

    • Campaign performance prediction
    • Federated learning experiments
    • Architecture search and model benchmarking
    • Comparisons between fine-tuned LLMs and custom classifiers

    We also built a shared model leaderboard so teams could compare results across dataset versions and training approaches without manual coordination.

    That created a common experimental foundation before real data was fully ready.

    What Synthetic Data Did (and Didn’t) Solve

    Synthetic data was an accelerator, not a replacement for real data.

    It let us:

    • start ML experiments earlier,
    • benchmark model architectures,
    • explore dataset schemas,
    • test class balance and difficulty settings,
    • and support teams that otherwise would have had to wait

    But it also has several limitations:

    The biggest one is that graph edges are still inferred, not directly validated against large-scale real campaign outcomes. We verified obvious cases, but many of the more ambiguous relationships remain assumptions generated from LLM reasoning rather than empirical evidence.

    References

    Van Can, A. T., Aydemir, F. B., & Dalpiaz, F. (2025). One size does not fit all: On the role of batch size in classifying requirements with LLMs. In Proceedings of the 2025 IEEE 33rd International Requirements Engineering Conference Workshops (REW 2025) (pp. 30–39). IEEE.

    Tam, Z. R., Wu, C.-K., Tsai, Y.-L., Lin, C.-Y., Lee, H.-Y., & Chen, Y.-N. (2024). Let me speak freely? A study on the impact of format restrictions on performance of large language models. arXiv:2408.02442https://doi.org/10.48550/arXiv.2408.02442

    Delahaye, D., Chaimatanan, S., & Mongeau, M. (2019). Simulated annealing: From basics to applications. In M. Gendreau & J.-Y. Potvin (Eds.), Handbook of Metaheuristics (Vol. 272, pp. 1–35). Springer. https://doi.org/10.1007/978-3-319-91086-4_1

  • Campaign Intelligence Dataset Pod

    Campaign Intelligence Dataset Pod

    Every ad campaign generates data. Many practitioners use this data to answer “how” questions: How did performance trend this week? How did this campaign compare with last quarter? But this data contains something deeper. It captures what makes some campaigns succeed and others fail through the specific combination of elements that define them: the audience, geography, platform, creative, and device. The Campaign Intelligence Dataset is designed to uncover that insight at scale. It collects, enriches, and harmonises campaign data from multiple marketing platforms into a unified, AI-ready schema, while addressing technical challenges and respecting hard client and legal constraints related to data access, isolation, and privacy. The result is a continuously growing asset with millions of data points across all major markets, used to power multiple pods across WPP Research.

    Turning Campaign Data into a Strategic AI Advantage

    1. Motivation: The opportunity hidden in campaign data


    Every ad campaign generates data, that apart from performance metrics, encode something very valuable most organisations overlook: what worked and what didn’t, across audiences, geographies, creatives, and platforms.

    What creative approach resonates best with this audience?

    Which zip codes should a luxury watch brand target?

    How should spend shift when a platform’s algorithm changes?

    These aren’t hypothetical questions. They are answerable, but only if the underlying data is structured to answer them. This is exactly the motivation behind the Campaign Intelligence Dataset that we have built in WPP Research.

    The dataset achieves three things:

    • Collects campaign marketing data across platforms
    • Enriches it with derived fields useful for AI training
    • Harmonises it into a unified, AI-ready schema, while respecting all applicable legal and client constraints related to data access, isolation (as certain datasets can never be mixed with others), and privacy.

    We refer to every input that shapes a campaign setup as modality: a single dimension that influences whether an ad performs well or poorly. Examples of campaign modalities tracked today include geography, audience, platform, placement, creative, device, and spend.

    modality combination is one specific configuration drawn from each dimension. A simple example:

    • Geo: UK
    • Audience: Women 25–34
    • Platform & Placement: Instagram feed
    • Creative: 15-second sun-lit outdoor lifestyle video
    • Device: iPhone

    In practice, the values of each modality can vary significantly across campaigns. Geo can be encoded as a country or a granular set of zip codes. Audience can be expressed in terms of simple demographics of be a free-text highly descriptive paragraph. Creatives can range from simple images and banners to short cinematic-style videos.

    Each combination is also attached to outcome metrics like impressions, clicks, conversions, and revenue, that measure how the given combination performed.

    2. Challenges: From Access Hurdles to Privacy at Scale


    Enterprise campaign data doesn’t arrive clean, unified, or ready for training. Between raw platform tables and an AI-ready dataset sits a gauntlet of practical problems.

    Getting to the data is the first obstacle. Source data lives behind governance gates. Access requires building trust with stakeholders, making a clear case for the dataset’s value, and operating entirely within approved cloud environments and client security policies. There are no shortcuts here; this is relationship work as much as engineering work.

    Once you have access, finding the signal is its own challenge. Marketing platforms expose dozens of tables with hundreds of fields. Which ones are useful for model training? Which require feature engineering, such as extracting targeting attributes from raw specs, mapping age ranges to generational cohorts, and normalising geography hierarchies? Which joins introduce duplicates? Which campaigns should be excluded entirely? Thankfully, AI has enabled tools that can help us tackle such challenges in a much more efficient way. WPP Research has multiple agentic initiatives to develop and and leverage such tools. Examples include:

    • The Data Discovery Agent initiative, focused on autonomously exploring the vast and diverse space of data sources to identify the most relevant data for each task.
    • The Data Quality Assurance Agent project, focused on continuously monitoring the data we compile for both structural and logical quality issues.

    Finally, there is the tension between scale and data separation. As the dataset matures, the opportunity grows: more platforms, deeper histories, and greater granularity. Yet not all data can or is allowed to be mixed. Our architecture is designed to respect those boundaries fully, applying common standards, schemas, and methodologies across assets while keeping the underlying data appropriately ring-fenced.

    3. How the data flows


    The diagram below illustrates the conceptual stages that a single client’s data passes through, from source platforms to AI-ready output. Each client’s data follows this path independently, within its own isolated environment.

    The pipeline is designed so that each client’s data is processed end-to-end within its own boundary. The shared element is the methodology and engineering, not the data itself. The key stages are:

    1. Data ingestion. Campaign data is pulled from marketing platform APIs (Meta, TikTok, etc.) and routed into a client-specific storage environment.
    2. Isolated client data store. Each client’s data resides in its own segregated environment, ensuring strict separation at every layer. No data is co-mingled or shared across clients.
    3. Transformation and enrichment. A standardised pipeline reads from the client’s own data store, enriches fields, applies cleaning rules, and engineers features. The pipeline runs on a regular cadence.
    4. Campaign Intelligence Dataset. The output: a compliant, harmonised, AI-ready dataset. Clean, structured, and securely ring-fenced.
    5. AI model training. WPP Research pods use the data to train and validate models for various research projects focused performance forecasting, creative analysis, and more.

    4. Impact


    • A large and diverse dataset that compounds over time. The data covers an ever-increasing population of platforms, audiences, geographies, creatives, and other modalities, all attached to rich performance metrics.
    • AI-ready from day one. A common harmonised schema design captures key modalities alongside outcome metrics, purpose-built for model training. Multiple WPP Research pods already consume these datasets for forecasting, performance prediction, and creative analysis.
    • Privacy by design, enforced by architecture. Data isolation is not a policy overlay; it is embedded in the pipeline’s design. Access controls, segregation, and governance rules ensure that each client’s data is governed, processed, and analysed independently and while respecting all applicable constraints.

    5. Lessons learned and what’s next


    Building the Campaign Intelligence Dataset surfaced a few principles worth naming:

    • Centralise the methodology, not the data. The pipeline design, schema definitions, and engineering tooling are reused across engagements, but each data asset is processed and stored in complete isolation.
    • Domain knowledge isn’t optional. Understanding the business cases behind the data is what catches the edge cases and quality issues that pure engineering misses.
    • Treat data policy as architecture, not overhead. Vigilance about who can access the dataset, why, and under what conditions is a design constraint baked into every layer of the system.

    With a strong foundation set, the next priorities are:

    • Field enrichment. Continue exploring platform documentation for additional attributes that help understand what drives performance.
    • New platforms. Onboard additional social media and DSP platforms to uncover more nuanced patterns across a broader set of channels.