Author: Emmanouil Kritharakis

  • Training together, sharing nothing: The promise of Federated Learning

    Why Federated Learning now?

    In marketing, data is a competitive edge. The more audience signals, campaign performance data, and consumer behaviour a Machine Learning (ML) model can learn from, the sharper its predictions and the greater its business impact. Across the marketing ecosystem, spanning agencies, brands, and technology partners, organisations individually hold rich and valuable datasets. The potential to learn shared patterns across these assets, without ever exposing proprietary information, could unlock new capabilities: better audience targeting, smarter media spend, and faster creative optimisation at a global scale.

    But here’s the challenge. Across the marketing industry, the most impactful data is inherently distributed. Each organisation, whether agency, brand, or technology partner, holds a unique piece of the data puzzle. Client contracts, privacy regulations like GDPR, and the sheer sensitivity of consumer-level data mean this data has to stay within each organisation’s walls. This is a structural reality of the industry, not a limitation of any single company. The question is whether there is a way to learn collectively from this distributed knowledge without compromising the privacy boundaries that exist for good reason.

    The traditional solution, centralised ML, pools raw data from multiple sources into a single cloud to train a global model. But uploading terabytes of sensitive data to a central server creates severe network latency and exposes collaborators to data breaches and potential violations of privacy regulations.

    Distributed ML methods attempted to address this by splitting training across local worker nodes. Whilst this reduces latency and avoids centralising raw data, these architectures were designed for internal computing clusters, not secure collaboration between independent companies. Without cross-organisation coordination, each organisation’s models are limited to what their own data can teach them, with no mechanism to benefit from shared learning.

    Problem: The collaboration vs. privacy bottleneck

    Organisations face a fundamental tension: gaining the benefits of shared learning typically requires centralising sensitive data, which privacy and contractual obligations rightly prevent. Yet without a way to learn collectively, each organisation’s models are limited to their own data alone. Neither traditional architecture offers a path to collaborative model improvement whilst keeping private data strictly where it belongs.

    Federated Learning (FL) offers a way out of this dilemma by bringing the model to the data, rather than the other way around. To understand why this shift matters, let’s look at how FL actually works under the bonnet.


    How Federated Learning works

    Figure 1: Overview of the Federated Learning communication cycle between a central node and distributed client nodes.

    Federated Learning (FL) enables multiple organisations to collaboratively train a shared model without ever centralising raw data. Instead of moving data to the model, FL brings the model to the data. Training proceeds through iterative rounds:

    1. A central server sends the current global model to all participating nodes (Blue arrow).
    2. Each node trains the model locally on its own private data (Green arrow).
    3. Nodes send back only their model updates, never the underlying data (Pink arrow).
    4. The server aggregates these updates into an improved global model and starts the next round.

    Throughout this process, raw data never leaves its source. Only learned model representations are exchanged across the network.

    The multimodal challenge

    Whilst the above privacy-preserving framework is valuable in its own right, modern marketing data adds another layer of complexity. Organisations do not just work with spreadsheets and numbers. They work with images, video, text, audio, and structured data, often all at once. A single campaign might involve visual brand assets, ad copy, audience segments, and performance metrics across channels. Training models that can reason across these different data types, known as multimodal learning, is already one of the most demanding challenges in ML.

    Now combine that with the constraints of federated learning. Each client may hold different combinations of modalities, in different formats and volumes. One partner might contribute rich visual data, another mostly text and tabular records. Coordinating a single global model that learns effectively from this fragmented, heterogeneous landscape, without ever seeing the raw data, pushes the problem to a new level of complexity.

    This is precisely what makes the intersection of FL and multimodal learning so important, and so hard. If it can be made to work, it unlocks collaborative intelligence across organisations at a scale that neither approach could achieve alone.


    Our objective: can Federated Learning deliver?

    The promise of FL is compelling, but before investing in real-world deployment, we need to answer a fundamental question:

    Does federated learning actually work well enough on multimodal marketing data to justify the tradeoff?

    Centralised training will always have an inherent advantage, it sees all the data at once. The question is not whether FL can beat centralised performance, but whether it can get close enough to make the privacy and collaboration benefits worthwhile. And beyond raw performance, we need to understand how FL behaves under realistic stress conditions: more partners joining, noisy data, and complex cross-modal relationships.

    To answer this, we designed a series of experiments around four key questions:

    Experiment 1 — Centralised vs. federated performance

    • How close can FL get to centralised performance? In a centralised setup, the model sees all the data at once, the ideal scenario for learning. FL, by design, fragments this data across nodes. The first question is whether this tradeoff costs us meaningful accuracy, or whether FL can match centralised results despite never accessing the full dataset.
    • What happens as more clients join? In practice, a federated network might involve a handful of partners or dozens. As the number of participants grows, each node holds a smaller, potentially less representative slice of the overall data. We tested how model performance scales as we increase the number of nodes.

    Experiment 2 — Resilience to noisy data

    • How robust are centralised and federated models to noisy data? Real-world datasets are messy, labels can be wrongly defined, and data quality varies across partners. We deliberately introduced noise into the multimodal dataset to simulate these imperfections and measure how much degradation the model can tolerate before performance breaks down.

    Experiment 3 — Cross-modal relationships

    • How sensitive centralised and federated models to underlying cross-modal patterns? Multimodal models learn by connections between different types of data. For example, a luxury brand might target a high-income audience through a premium creative tone on a specific platform. Some of these connections appear frequently in the data, whilst others are rare. We tested whether emphasising the most frequent cross-modal patterns in our synthetic data improves performance compared to emphasising the least frequent ones, helping us understand how much the model benefits from common, naturally occurring relationships versus rare, atypical ones.

    The data

    For our experiments, we used a multimodal synthetic dataset generated by our own well-tested synthetic data generator, designed to mirror real-world marketing dynamics. The generator allows us to customise various elements of the data and design targeted datasets that stress-test our model architecture under controlled conditions, giving us full visibility into the factors that drive campaign performance.

    Each campaign in the dataset is described using five key modalities:

    • Audience – the consumer segment being targeted
    • Brand – the positioning and perception of the brand
    • Creative – the tone and message of the campaign
    • Platform – where the campaign runs
    • Geography – the markets being targeted

    Each dataset’s sample is assigned a target label, Positive (Over performing), Negative (Under performing), or Average (Average performance), indicating whether that particular combination of modalities would lead to a successful, underperforming, or average campaign outcome.

    Experimental results

    All federated experiments are implemented using Flower, a widely adopted open-source framework for federated learning research and deployment. Flower allows us to simulate multi-client federated setups in a controlled environment, making it possible to rigorously test different configurations before moving to a fully distributed architecture.

    To ensure a fair comparison between centralised and federated setups, we kept the playing field level. Both setups use the exact same model architecture, so any performance differences come from how the model is trained, not what is being trained. In the federated setup, data is split equally across nodes, so that each partner sees a representative sample. This way, when we increase the number of nodes, any change in performance can be attributed to the scaling itself, not to differences in what each node’s data looks like.

    Experiment 1: How much accuracy do we trade for privacy?

    Figure 2: Impact of increasing node fragmentation on Federated Learning performance. Performance clearly degrades as the number of nodes increases from 5 to 15, compared to the baseline centralised model version.

    The centralised model sets the performance ceiling at 79.67%. This is expected, when a single model has direct access to all the data at once, it has the best possible conditions to learn. No information is lost to partitioning, and no coordination overhead is introduced. It’s the ideal scenario, and the benchmark everything else is measured against.

    The federated results tell a clear story: as we add more nodes, performance gradually declines. With 5 nodes, the model reaches 76.23%, a modest drop from the centralised baseline. But as we scale to 10 and then 15 nodes, scores fall to 70.29% and 67.65% respectively. The same pattern holds across all metrics, with the sharpest drops in the model’s ability to correctly identify both positive and negative cases.

    Why does this happen? As more nodes join, the total dataset gets divided into smaller slices. Each node sees less data, which means each node’s local training produces a less reliable picture of the overall patterns. When the server combines these local updates, the differences between them make it harder to converge on a strong global model, an effect we call the “aggregation penalty.”

    Lesson learned: FL with 5 nodes comes remarkably close to centralised performance, showing that federated collaboration is viable with minimal accuracy loss. However, as the number of nodes grows, makes it progressively harder for the global model to match centralised results.

    Experiment 2: How robust are centralised and federated models to noisy data? 

    In practice, marketing data is never perfectly clean. Campaign outcomes don’t fall neatly into “this worked” or “this didn’t.” Was a campaign that slightly exceeded expectations truly a success, or just average? Was a modest underperformance a failure, or noise in the measurement? Different teams may label the same outcome differently, tracking systems introduce inconsistencies, and the line between a “positive” and “average” campaign is often blurry.

    To simulate this reality, we deliberately introduced noise into our synthetic dataset by blurring the boundaries between performance classes. With no noise, the labels are clean — positive, negative, and neutral outcomes are clearly separated. As we increase the noise level from low, to medium, and then to high, the boundaries between these classes increasingly overlap, making it harder for the model to tell them apart. Think of it like gradually turning up the fog: the underlying patterns are still there, but they become harder to see. The federated learning simulation for this experiment was configured with 5 participating clients, consistent with the best-performing federated setup identified in Experiment 1.

    Figure 3: Performance comparison of centralised (left) and federated learning (right) configurations across increasing noise levels. Both paradigms degrade gradually, with Positive F1 and Negative F1 most affected, whilst the performance gap between the two remains approximately constant across all conditions.

    As expected, both models perform best on clean data and gradually decline as noise increases. At high noise:

    • The centralised model’s score drops from 82.05% to 78.11%
    • The FL model’s score drops from 80.74% to 76.09%

    The good news: neither model collapses. Even at the highest noise level both models still perform reasonably well. The overall accuracy dips, and the models struggle most with distinguishing clearly positive or clearly negative campaigns, which makes sense, since those are exactly the boundaries we blurred. However, their ability to capture general patterns across the dataset remains stable throughout.

    As in Experiment 1, the centralised model maintains a consistent edge over the federated setup at every noise level, but the gap between them stays roughly the same. This means that FL doesn’t become more fragile in noisy conditions; it handles data messiness about as well as its centralised counterpart.

    Lesson learned: Real-world data is inherently noisy, and any viable model must be able to handle that. Both centralised and FL models show strong resilience — performance declines gradually rather than breaking down, even when the data is heavily corrupted. Importantly, FL’s relative performance holds steady across noise levels, suggesting it is no more vulnerable to messy data than centralised training.

    Experiment 3: How sensitive are centralised & federated models to underlying cross-modal patterns?

    Our synthetic data generator creates campaign data based on a graph of relationships between five key factors: Audience, Brand, Creative, Platform, and Geography. Each relationship captures whether a particular combination of these factors tends to drive strong or weak campaign performance. Some of these relationships are common and obvious — they show up frequently and reflect well-known marketing dynamics. Others are rare and subtle — unusual combinations that don’t appear often but may carry uniquely valuable signal about what makes a campaign succeed or fail.

    Understanding how these different types of patterns affect learning is important for both training paradigms. If the nature of the underlying data patterns matters, we need to know whether centralised and federated models respond to them in the same way — or whether one setup handles certain patterns better than the other. To investigate this, we generated three versions of our dataset, keeping everything else the same:

    • Common-first: The generator focuses on the most frequently occurring combinations and downplays the rarest ones. This gives us a dataset dominated by typical, familiar marketing patterns.
    • Rare-first: The opposite — the generator prioritises the rarest combinations and downplays the most common. This fills the dataset with unusual, less obvious patterns.
    • Middle-ground: The generator focuses on combinations that fall in the middle of the frequency spectrum, neither the most common nor the rarest.

    As in Experiment 2, the federated learning simulation was run with 5 participating nodes, and performance was compared against the centralised baseline across all three dataset versions.

    Figure 4: Impact of cross-modal relationships on model performance. Prioritising rare feature combinations (Rare-first) substantially improves accuracy compared with focusing on common patterns, showing that atypical relationships provide a stronger learning signal for both centralised and federated learning paradigms.

    The results were striking. The Rare-first configuration dramatically outperformed the other two, achieving peak scores of 94.41% (Centralised) and 93.43% (FL), compared to scores in the 86–88% range for the Common-first and Middle-ground setups.

    This tells us something counterintuitive: the model learns far more from unusual feature combinations than from common ones. The typical, frequently seen patterns are in some sense “easy”, they don’t give the model much new information. But rare combinations force the model to learn more nuanced and distinctive boundaries between what makes a campaign succeed or fail.

    As in previous experiments, the centralised model maintains a small edge over FL, but the ranking between dataset strategies stays the same in both setups. Whether training centrally or federally, prioritising rare patterns is the winning strategy.

    Lesson learned: Not all data is equally valuable. Prioritising on rare, atypical feature combinations produces significantly better models than focusing mostly on common patterns. This has direct implications for how we design synthetic datasets: rather than mimicking the most typical marketing dynamics, we should deliberately include uncommon combinations to give the model a richer and more discriminative learning signal.

    The impact and looking ahead

    This work is just the initial spark for our federated learning efforts. Verifying that the centralised ML model performance our company provides is slightly degraded under a reasonable number of users opens the discussion about delivering ML solutions that address shared challenges among clients who are reluctant to share data to tackle a common industry problem. The FL approach allows companies to securely train a shared global model on their own datasets without the risk of data leakage throughout the training process.

    Although Federated Learning has been an established collaborative learning method since 2017, it remains a highly active research domain in academia and a strategic priority for industrial implementation.

    Our next objective is to stress-test and further expand our federated learning (FL) infrastructure by enabling learning across nodes that hold highly heterogeneous data, with substantially different shapes, feature spaces, and underlying distributions. This introduces a number of technical challenges, including how to align representations, aggregate knowledge effectively, and maintain stable performance when local data varies significantly from node to node. Overcoming these challenges will unlock deeper insights into the robustness and scalability of our FL framework, and will allow our models to learn more effectively in realistic, decentralised settings where data heterogeneity is the norm rather than the exception.

    Ready to explore the specifics? Read our full technical deep dive into Multimodal Federated Learning for a closer look at our methodology.

    Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP AI Lab team.

  • Multimodal Federated Learning Pod

    Multimodal Federated Learning Pod

    Building strong marketing Machine Learning (ML) models requires diverse data spanning multiple facets of campaign performance, but the most valuable data is distributed across organisations and, for good reason, stays firmly within their walls. This structural reality limits the industry’s ability to build truly intelligent campaign optimisation at scale. Federated Learning (FL) offers a way forward: each participant trains on their own data locally, sharing only model updates, never a single row of raw data. In our experiments on synthetic multimodal marketing data, FL with a small number of participants came remarkably close to centralised performance, proved resilient to noisy data, and benefited equally from rare feature combinations, a surprisingly powerful driver of model quality. The findings lay the groundwork for deploying privacy-preserving ML solutions that enable collaborative learning whilst ensuring every organisation’s data remains strictly under its own control.

    If you don’t care about the technical details, read our blog post instead. The code repository will also be released soon.

    Multimodal Federated Learning for marketing outcome prediction: A deep dive Flower-based simulation analysis

    Training high-capacity models for marketing prediction is often constrained not by algorithmic complexity but by data locality. Across the marketing industry, informative signals naturally reside within individual organisations, and raw data is typically non-transferable due to privacy and governance restrictions. Conventional centralised training, pooling all data to train a single model, is therefore rarely feasible, whilst standard distributed training within a single trust domain does not address the fundamental challenge of improving models when relevant knowledge is held independently by separate entities.

    The multimodal challenge

    The problem is further complicated by the inherently multimodal nature of modern marketing data. A single campaign may span visual brand assets, textual ad copy, structured audience segments, and cross-channel performance metrics. Multimodal learning, training models to reason jointly across such heterogeneous inputs, is already among the most demanding areas in machine learning. Under collaborative constraints, the challenge intensifies: each participating organisation may hold different modality combinations in varying formats and volumes, and a shared model must learn effectively from this fragmented landscape without any raw data ever leaving its source.

    Why use Federated Learning?

    Federated Learning (FL) offers a principled alternative by keeping data at the edge and exchanging only model updates. In the standard cross-silo formulation, each participant trains locally on its own private data, a central coordinator aggregates the resulting model parameters, and the process repeats over communication rounds until convergence. This report studies that pipeline in a multimodal classification setting, where each sample is a structured composition of contextual modalities and the task is ternary outcome prediction.

    The objective

    A fundamental question must be answered before committing to real-world deployment:

    Does federated learning perform well enough on multimodal marketing data to justify its tradeoffs?

    Centralised training retains an intrinsic advantage, full visibility into the data distribution at every optimisation step. The goal of this work is not to show that FL surpasses centralised performance, but to determine whether it approximates it closely enough for the privacy and collaboration benefits to be practically worthwhile. We further examine FL robustness under realistic stress conditions: scaling the number of participating nodes, introducing data noise, and investigating the complexity of relationships among different modalities. If this intersection of federated and multimodal learning can be made viable, it enables collaborative intelligence across organisations at a scale neither approach could achieve alone.

    To ensure a controlled, reproducible analysis, we generate multimodal datasets with a configurable synthetic generator and assign them to virtual nodes using identically independently distributed (IID)-like, homogeneous partitioning to simulate collaborative training. We implement the full FL loop in simulation mode using Flower, a state-of-the-art federated learning framework. The experiments that follow quantify how federation affects predictive quality relative to a centralised baseline, and how performance varies with (i) the number of nodes, (ii) dataset noise levels, and (iii) the prioritisation of cross-modal relationships in the synthetic data generation based on their frequencies.


    Handling the data

    Synthetic data generator

    For our experiments, we used proprietary multimodal synthetic datasets generated with an internally developed framework. The synthetic data was designed to replicate real-world marketing dynamics, where campaign outcomes are shaped by the interplay of multiple contextual factors (modalities). The framework provides precise control over the composition of each data instance, enabling rigorous evaluation of the proposed model under known conditions and full transparency into the factors driving campaign performance. The specific hyper parameters governing each dataset’s generation are detailed in the experimental sections below, where they are adjusted according to each experiment’s objectives.

    Each campaign instance within the dataset is characterised by five distinct modalities:

    • Audience – the target consumer segment,
    • Brand – the strategic positioning and perceptual attributes of the brand,
    • Creative – the tonal and messaging characteristics of the campaign,
    • Platform – the distribution channel on which the campaign is deployed, and
    • Geography – the market region(s) in which the campaign is activated.

    Each sample is assigned a categorical performance label — Positive (Overperforming), Negative (Underperforming), or Average (Performing within expected bounds) — indicating the projected campaign outcome for a given multimodal configuration. This ternary classification scheme enables the model to learn discriminative representations across the full spectrum of campaign effectiveness.s across the full spectrum of campaign effectiveness.

    Data partitioning

    After generating the training and test datasets with the synthetic data generator, the training data must be partitioned across federated clients to simulate a real-world scenario in which multiple companies collaboratively train the model, each holding its own proprietary data. In practice, this partitioning step would not be necessary, as each company would naturally possess its own local dataset. However, in our simulated environment, the codebase provides the partition-dataset script, which splits the training dataset across a user-defined number of nodes using either a homogeneous or heterogeneous partitioning strategy.

    Under homogeneous partitioning, the script groups training samples by label and divides each label’s indices into equal-sized segments across the specified number of clients. This ensures each node receives a statistically representative subset of the data with approximately uniform class proportions. Under heterogeneous partitioning, the script samples a Dirichlet distribution with concentration parameter alpha to determine the proportion of each class assigned to each node, producing non-IID label skew of configurable severity. Lower alpha values yield more extreme label imbalance across clients, whilst higher values produce distributions that progressively approach the homogeneous case. For our proof-of-concept research project, we focus our experiments on a homogeneous data split.

    Upon completion, the script outputs individual training files per node, a shared server test set, and a heat map visualising the label distribution across all partitions. An example of a homogeneous partition heat map across 5 nodes is shown below:

    Figure 1: Heat map showing homogeneous data partitioning across 5 clients in the federated learning scenario

    With the dataset now generated and partitioned into per-node training files, we can move from data preparation to the federated training setup. In the next section, we introduce Flower, the framework we use to orchestrate node selection, parameter exchange, and aggregation over these partitions.


    Flower: The Federated Learning framework

    What is Flower?

    Flower is an open-source, framework-agnostic federated learning framework that enables collaborative model training across decentralised data holders without requiring raw data to leave its source. It supports both real-world distributed deployment over gRPC and local simulation through a Virtual Client Engine (VCE) backed by the Ray distributed runtime. In simulation mode, nodes are virtualised as ephemeral objects instantiated on demand, allowing researchers to simulate federations of arbitrary size on a single machine with fine-grained resource control.

    Simulation cycle

    The simulation follows a centralised client-server architecture executed over iterative communication rounds. Each round proceeds through five sequential stages:

    Figure 2: Breakdown of a federated learning communication round into steps under the Flower framework.

    Each virtual node is instantiated at the beginning of its task, executes training or evaluation, returns results, and is immediately destroyed. This allows all five clients in this configuration to run concurrently without persistent memory allocation.

    The above diagram illustrates a complete FL communication round orchestrated by the Flower framework. The process begins with Client Selection, where the Flower Server selects a subset of participants from a broader pool. The server then distributes the current global model parameters to all selected clients simultaneously. Each client performs Local Training in parallel using Ray workers, training on its own private dataset to produce a locally updated model. These Local Updates are sent back to the server, which performs Federated Averaging (FedAvg). This is a weighted average, so clients with more data have proportionally more influence on the updated global model. A readable way to write the FedAvg update is:

    θglobal=k𝒦nkNθk\theta_{\text{global}} = \sum_{k \in \mathcal{K}} \frac{n_k}{N} \, \theta_k

    Where:

    • θglobal \theta_{\text{global}} is the updated global model parameters after aggregation.
    • 𝒦\mathcal{K} is the set of clients selected to participate in this round.
    • θk\theta_k is the model parameters after client kk finishes local training.
    • nkn_k is the number of training samples held by client kk.
    • N=k𝒦nkN = \sum_{k \in \mathcal{K}} n_k is the total number of samples across the participating clients.

    Finally, a Federated Evaluation step assesses the updated global model across all clients and reports per-client accuracy metrics. This pipeline repeats across successive FL rounds until convergence.

    A key architectural detail of Flower’s VirtualClientEngine underpins this workflow: each virtual client is instantiated at the start of its task, runs training or evaluation, returns results, and is immediately destroyed. This enables all clients in the configuration to run concurrently without persistent memory allocation, making the system highly scalable even on resource-constrained hardware. By keeping raw data decentralised at the edge and exchanging only model parameters, this design preserves data privacy by default whilst still enabling collaborative model improvement across heterogeneous environments.

    Configuration reference

    The simulation is governed by a YAML configuration organized into five sections that correspond to Flower’s core components. Here, we present the YAML file with its default values.

    server:
      strategy: Mean
      fraction_fit: 1.0
      fraction_eval: 1.0
      num_rounds: 4
      server_dataset_included: false
    client:
      num_clients: 5
    model:
      name: GeneralisedJointModel_3cls_feat_drop
      hidden_embed_dim: 256
      embed_dim: 256
      hidden_dim: 256
      dropout_prob: 0.3
      modality_dropout_prob: 0
      local_epochs: 5
      batch_size: 64
      lr: 1e-3
      weight_decay: 1e-4
      optuna_n_trials: 3
    general:
      use_wandb: true
      random_seed: 42
      text_embedding_task: classification
      data_version: V28_noise_0_percent
      data_split: homogeneous
      alpha: 0.1
      one_hot_modalities:
    backend:
      client_resources:
        num_cpus: 2.0
        num_gpus: 0.0
    

    To run an experiment, the user only needs to adjust the desired parameters in this YAML file and execute the simulation with:

    poetry run simulation
    

    The framework reads the configuration, initialises all components accordingly, and executes the full federated learning cycle without any additional setup. Below, we explain what each section’s default values represent.

    Server configuration

    Defines the central orchestrator responsible for client coordination, parameter distribution, and aggregation.

    ParameterDefault ValueDescription
    strategyMeanAggregation strategy implementing Federated Averaging (FedAvg), where client model parameters are averaged weighted by each client’s local dataset size.
    fraction_fit1.0Fraction of clients selected for training each round. At 1.0, all clients participate in every round.
    fraction_eval1.0Fraction of clients selected for evaluation each round. At 1.0, all clients evaluate after each aggregation.
    num_rounds4Total number of federated communication rounds. Combined with 5 local epochs per round (default value), each data sample is exposed to 20 effective training epochs.
    server_dataset_includedfalseThe server holds no data partition and acts purely as a parameter aggregator.

    Client configuration

    Defines the federated client pool. Each client represents an independent data silo with its own private partition.

    ParameterDefault ValueDescription
    num_clients5Total number of virtual clients in the federation. With full participation, this constitutes a cross-silo setting with a small number of reliable, always-available participants.

    Model configuration

    Defines the neural network architecture and the local training hyper parameters applied on each client.

    Architecture parameters

    ParameterDefault ValueDescription
    nameGeneralisedJointModel_3cls_feat_dropA custom multimodal fusion model with a modality-agnostic joint embedding space, three classification heads for multi-task prediction, and feature-level dropout regularisation.
    hidden_embed_dim256Dimensionality of hidden layers within each modality-specific encoder, applied before the fusion stage.
    embed_dim256Dimensionality of the joint fused embedding space shared across all classification heads.
    hidden_dim256Dimensionality of hidden layers within each of the three classification heads.
    dropout_prob0.3Dropout probability applied in the classification heads and hidden layers.
    modality_dropout_prob0Probability of dropping entire modality branches during training. At 0, all modalities are always present (disabled).

    Training parameters

    ParameterDefault ValueDescription
    local_epochs5Number of complete passes over a client’s local dataset per communication round.
    batch_size64Mini-batch size for local stochastic gradient descent.
    lr1e-3Learning rate for the local optimiser.
    weight_decay1e-4L2 regularization coefficient penalising large weight magnitudes.
    optuna_n_trials3Number of Optuna Bayesian hyper parameter optimisation trials.

    Data configuration

    Controls dataset versioning, partitioning strategy, and modality-specific preprocessing.

    ParameterDefault ValueDescription
    data_versionV28_noise_0_percentDataset version identifier pointing to a specific preprocessed dataset.
    data_splithomogeneousPartitioning strategy across clients. Points to a homogeneous data distribution folder where data is split in an IID manner and each client receives a statistically representative partition with similar class distributions. The alternative heterogeneous points to a non-IID distribution folder where splits are governed by alpha.
    alpha0.1Dirichlet concentration parameter controlling non-IID severity when data_split points to a heterogeneous distribution folder. Currently inactive as the configuration uses the homogeneous split. When active, lower values produce more extreme label skew across clients.
    text_embedding_taskclassificationConfigures the text modality encoder for a classification objective, affecting pooling strategy and embedding optimisation.
    one_hot_modalitiesnullSpecifies modalities requiring one-hot encoding. Currently none.

    The data_split and alpha parameters in the simulation YAML configuration must match the partitioning strategy used when running the partition-dataset script, because the simulation reads client data from the output directory, whose folder name encodes both the split type and the number of clients.

    Simulation backend & experiment management

    Controls reproducibility, logging, and resource allocation for parallel execution of virtual clients through the Ray runtime.

    ParameterDefault ValueDescription
    use_wandbtrueEnables Weights & Biases experiment tracking for real-time metric logging and cross-experiment comparison.
    random_seed42Global seed ensuring reproducibility in our experiments.
    num_cpus2.0CPU cores reserved per virtual client task. Ray schedules a client only when the required cores are available.
    num_gpus0.0GPU allocation per client. At 0.0, all computation runs on CPU and concurrency is bounded solely by CPU availability.

    Experiments

    The following section outlines three experiments that compare federated learning with centralised training under different conditions.

    • Experiment 1 establishes a baseline comparison and tests how FL performance changes as the number of clients increases.
    • Experiment 2 examines how robust both approaches are when controlled noise is added to the training data.
    • Experiment 3 evaluates whether emphasising common vs. rare cross-modal relationships during synthetic data generation affects model performance.

    Overall, these experiments measure both the absolute performance of each approach and whether the performance gap between centralised and federated training remains consistent as the setting becomes more challenging.

    To ensure a fair comparative evaluation, all experimental variables were held constant across centralised and federated configurations. Both setups employ an identical model architecture, ensuring that any observed performance differences are attributable solely to the training paradigm rather than structural variations in the model itself.

    The total computational training budget was also standardised: the centralised model undergoes 20 sequential training epochs, whilst the federated configuration distributes this across 4 global communication rounds with each client performing 5 local epochs per round, yielding an equivalent total of 20 epochs. Furthermore, the training data is partitioned uniformly across all participating clients under a homogeneous assumption, ensuring that each client’s local dataset is a representative subset of the global distribution. This controlled partitioning ensures that any performance degradation observed with increasing client counts can be attributed to the effects of federation and aggregation at scale, rather than to statistical heterogeneity across client data.

    Experiment 1: Centralised vs. federated performance

    The data

    We employed the V16 synthetic dataset, generated using the synthetic data generator, comprising 84K training samples and 36K test samples. For both centralised and federated training configurations, raw input features, specifically Audience, Brand, Creative, Platform, and Geography, were encoded into dense representations using Vertex AI embeddings as a preprocessing step.

    The results

    Training ConfigurationScoreNegative F1Positive F1Average F1
    Centralised (Baseline)0.79670.70740.73050.9523
    FL with 5 clients0.76230.68410.65710.9458
    FL with 10 clients0.70290.57810.59020.9405
    FL with 15 clients0.67650.53450.55810.9368

    The centralised training configuration establishes the upper performance bound at a score of 0.7967. This outcome is theoretically expected, as the model benefits from unrestricted access to the complete dataset, without information loss due to partitioning or coordination overhead inherent in distributed paradigms. It therefore serves as the reference benchmark for all federated configurations.

    The federated learning results reveal a consistent and monotonic degradation in performance as the number of participating clients increases. With 5 clients, the model achieves a score of 0.7623 — a modest decline of approximately 3.5 points from the centralised baseline. However, scaling to 10 and 15 clients yields more substantial reductions to 0.7029 and 0.6765, respectively. This pattern is uniformly reflected across all evaluation metrics; however, the decline is most pronounced in the Negative F1 and Positive F1 scores, which degrade at a markedly steeper rate than Average F1. This suggests that class-specific discriminative performance is more sensitive to data partitioning than overall classification ability.

    This observed degradation is primarily attributable to the aggregation penalty. As the number of clients grows, the training corpus is divided into progressively smaller subsets, resulting in local model updates that are less representative of the global data distribution. The increased variance among these updates introduces noise during server-side aggregation, impeding convergence toward a robust global model.

    Lesson learned: FL with 5 clients comes remarkably close to centralised performance, showing that federated collaboration is viable with minimal accuracy loss. However, as the number of clients grows, makes it progressively harder for the global model to match centralised results.

    Experiment 2: Resilience to noisy data

    The data

    To conduct this experiment, it was necessary to generate synthetic data with controlled levels of noise. To understand what noise means in this context, it is important to first describe how the synthetic data is generated.

    The data generation process is grounded in a predefined graph structure. In this graph, nodes represent distinct values for each modality, namely Audience, Brand, Creative, Platform, and Geography, whilst edges encode the pairwise relationships between these values. Each edge carries a label of either Positive (indicating an over performing campaign) or Negative (indicating an underperforming campaign).

    The generator samples from this graph to produce a user-defined number of data points, subject to a set of hard constraints governed by configurable hyper parameters. Specifically, the user defines the desired number of samples for each target label: PositiveNegative, and Average. The generation of a single data sample proceeds as follows:

    1. Value Selection: One or more unique values are selected for each modality.
    2. Pairwise Evaluation: All pairwise combinations among the selected values are evaluated against the graph. Each combination is classified as positivenegative, or missing — the latter indicating that no edge exists between the two values in the graph.
    3. Proportion Calculation: The proportions of positive, negative, and missing combinations are computed relative to the total number of pairwise combinations.
    4. Label Assignment: These proportions are then compared against predefined acceptable ranges specified in the hyper parameters for each target label. If the proportions fall within the range defined for PositiveNegative, or Average, the sample is assigned the corresponding label. If the proportions do not satisfy any of the defined ranges, the sample is discarded and the generation process is repeated.

    A key question that arises from this process is: How are the predefined acceptable ranges for each target label determined?

    To address this, we conducted the following preliminary experiment. We randomly sampled 10,000 subgraphs, each comprising 1,000 edges, from the initial graph. For each subgraph, we computed the proportions of positive, negative, and missing pairwise combinations. From these 10,000 samples, we derived the mean and standard deviation among Positive, Negative, and Missing values. These statistics were then used to define the acceptable range for the Average target label, representing the typical composition of a randomly sampled subgraph.

    The acceptable ranges for the Positive and Negative target labels were subsequently defined by shifting the boundaries of the Average range along the respective axes. Specifically, the Positive range requires the proportion of positive combinations to exceed the Average upper bound by at least 5 standard deviations, and similarly, the Negative range requires the proportion of negative combinations to exceed the Average upper bound by the same margin. This ensures a clear statistical separation between the three label categories, such that samples assigned to the Positive or Negative class exhibit meaningfully distinct distributional characteristics from those labelled as Average.

    Based on the above methodology, we established the appropriate acceptable ranges for each target label. This, however, raises a subsequent question: What constitutes noise in this context?

    Figure 3: Impact of additive noise on the acceptable ranges for each target label in the synthetic data generator. As noise increases from zero to high, the opposing acceptable range for each sample’s target label progressively widens. This increases the acceptable proportion of negative combinations for the Positive label, positive combinations for the Negative label, and both equally for the Average label, thereby reducing the distributional separation between label categories.

    In our framework, noise is defined as the relaxation of the opposing acceptable range for a given target label. Specifically, introducing noise to the Positive target label corresponds to increasing its acceptable proportion of negative combinations — effectively reducing the degree of “positiveness” required for a sample to be classified as Positive. Conversely, adding noise to the Negative target label increases its acceptable proportion of positive combinations. For the Average target label, the additive noise is distributed equally across both the Positive and Negative acceptable ranges.

    This noise mechanism is applied at three levels of intervention — low, medium, and high — each progressively widening the acceptable range of the opposing value for a given target label. The figure above illustrates how the acceptable ranges for each target label are impacted under each level of intervention.

    To support this experiment, four synthetic datasets were generated, each comprising 84K training samples and 36K test samples:

    • Clean: No noise intervention applied.
    • Low Noise: Low-level relaxation of the opposing acceptable ranges.
    • Medium Noise: Medium-level relaxation of the opposing acceptable ranges.
    • High Noise: High-level relaxation of the opposing acceptable ranges.

    The federated learning simulation was configured with five participating clients, and performance was evaluated against the centralised baseline across all four dataset conditions.

    The Results

    Training ConfigurationScoreNegative F1Average F1Positive F1
    FL with no noise0.80740.80290.86330.7559
    Centralised with no noise0.82050.81650.86920.7760
    FL with low noise0.79230.79020.85550.7313
    Centralised with low noise0.81590.80390.87500.7686
    FL with medium noise0.78260.77180.85700.7189
    Centralised with medium noise0.80170.78760.86430.7533
    FL with high noise0.76090.73440.86250.6859
    Centralised with high noise0.78110.75510.86590.7221

    As anticipated, both the centralised and federated models achieve their highest performance on clean data and exhibit a gradual decline as noise levels increase. At the highest noise intervention, the centralised model’s score decreases from 0.8205 to 0.7811, whilst the federated model’s score declines from 0.8074 to 0.7609 — representing drops of approximately 3.9 and 4.7 percentage points, respectively.

    Notably, neither model exhibits catastrophic degradation under any noise condition. Even at the highest level of intervention, both configurations maintain reasonable performance. The most pronounced declines are observed in the Positive F1 and Negative F1 scores, which is consistent with the noise injection methodology described above: since noise is introduced by relaxing the opposing acceptable range for each target label, the boundaries between Positive and Negative classes become increasingly blurred, making these the most challenging distinctions for the model. In contrast, the Average F1 remains remarkably stable across all noise levels for both configurations, indicating that the models’ capacity to capture general distributional patterns is largely unaffected by the introduced noise.

    Consistent with the findings from Experiment 1, the centralised model maintains a performance advantage over the federated configuration at every noise level. However, the magnitude of this gap remains approximately constant across all noise conditions. This observation is significant: it indicates that the federated setup does not exhibit increased sensitivity to noisy data relative to its centralised counterpart. The performance differential between the two paradigms is attributable to the aggregation penalty discussed in Experiment 1, rather than to any compounding effect of noise on the federated training process.

    Lesson learned: Real-world data is inherently noisy, and any viable model must be able to handle that. Both centralised and FL models show strong resilience, performance declines gradually rather than breaking down, even when the data is heavily corrupted. Importantly, FL’s relative performance holds steady across noise levels, suggesting it is no more vulnerable to messy data than centralised training.

    Experiment 3: Impact of cross-modal relationships under synthetic data generation

    The data

    Leveraging the synthetic data generator enables the investigation of additional structural characteristics of the initial marketing graph — specifically, the edge types representing pairwise relationships between modalities. Understanding which modality pairs (e.g., Audience–Brand or Creative–Geography) are most influential on model performance is of particular interest.

    To this end, we conducted a preliminary analysis: 10,000 subgraphs, each comprising 1,000 edges, were sampled from the initial multimodal graph, and the mean and standard deviation of the observed edge-type frequencies were computed. The table below presents the modality relationships ranked by frequency, from the most common to the most rare.

    Modality RelationshipMean Frequency
    Brand to Content8.48
    Audience to Content7.29
    Content to Geography6.57
    Audience to Brand5.90
    Audience to Geography5.84
    Brand to Geography5.55
    Content to Content3.24
    Brand to Brand3.18
    Content to Platform1.88
    Brand to Platform1.52
    Audience to Audience1.18
    Audience to Platform1.16

    This frequency distribution informed the design of a subsequent performance-based experiment. The synthetic data generator exposes two relevant hyper parameters: a high pair preference, which increases the likelihood of sampling edges from the specified modality relationships, and a low pair preference, which suppresses them. Using these controls, three synthetic datasets were generated under distinct configurations:

    • Common First: The two highest-frequency modality relationships are assigned as the high pair, and the two lowest-frequency relationships as the low pair.
    • Rare First: The inverse configuration, where the two lowest-frequency relationships are assigned as the high pair and the two highest as the low pair.
    • Middle Ground: The four middle-ranked relationships from the frequency table are assigned to the high and low pairs accordingly.

    All remaining hyper parameters were held constant across the three configurations: noise levels were set to zero, and the federated learning simulation was conducted with five participating clients, consistent with the setup described in prior experiments. The training size remains 84k, as the test size is equal to 36K.

    The Results

    Training ConfigurationScoreNegative F1Average F1Positive F1
    Centralised Common First0.88030.88560.92450.8308
    FL Common First0.87480.87750.91500.8320
    Centralised Rare First0.94410.95030.95790.9242
    FL Rare First0.93430.94700.95050.9054
    Centralised Middle Ground0.88130.87440.91430.8551
    FL Middle Ground0.86250.86770.90090.8188

    The results reveal a notable disparity in performance across the three dataset configurations. The Rare First configuration substantially outperforms the other two, achieving scores of 0.9441 (Centralised) and 0.9343 (FL) — a margin of approximately 6–8 percentage points over the Common First and Middle Ground configurations, which yield scores in the 0.86–0.88 range. This performance advantage is consistently reflected across all evaluation metrics, with particularly pronounced gains in Positive F1, where the Rare First configuration achieves 0.9242 (Centralised) and 0.9054 (FL), compared to values in the 0.81–0.85 range for the alternative configurations.

    This finding is counterintuitive yet theoretically interpretable. Frequently occurring modality combinations, by virtue of their prevalence, contribute comparatively less discriminative information to the learning process — the decision boundaries they define are, in effect, already well-represented and easily separable. In contrast, rare combinations compel the model to learn more nuanced and distinctive feature interactions, resulting in richer decision boundaries between Positive and Negative campaign outcomes. The learning signal provided by atypical patterns is therefore disproportionately more informative per sample.

    Consistent with findings from prior experiments, the centralised model maintains a modest performance advantage over the federated configuration across all three dataset strategies. Crucially, however, the relative ranking of dataset configurations remains identical under both training paradigms: Rare First consistently outperforms Common First and Middle Ground, regardless of whether training is conducted centrally or in a federated manner.

    Lesson learned: Not all data is equally valuable. Prioritising rare, atypical feature combinations produces significantly better models than focusing mostly on common patterns. This has direct implications for how we design synthetic datasets: rather than mimicking the most typical marketing dynamics, we should deliberately include uncommon combinations to give the model a richer and more discriminative learning signal.


    Impact and future directions

    This work represents an initial investigation into the viability of federated learning within our operational context. The finding that centralised model performance degrades only marginally under a reasonable number of participating clients opens a promising avenue for delivering machine learning solutions that address shared industry challenges among organisations reluctant to pool their data. The federated learning paradigm enables multiple entities to collaboratively train a shared global model on their respective proprietary datasets, without exposing raw data at any stage of the training process, thereby mitigating the risk of data leakage.


    Although Federated Learning has been an established collaborative learning paradigm since its introduction in 2017, it remains a highly active area of research in academia and a strategic priority for industrial adoption. Our initial findings establish the foundation for continued exploration, with future work organised around the following directions:

    1. Privacy guarantees in adversarial federated environments

    Whilst FL enables organisations to collaborate on a shared model without moving or centralising raw data, it’s important to be clear about the remaining risk surface: the exchanged model updates must be handled securely. Extensive literature shows that, in adversarial settings, updates can be targeted by malicious participants or exposed by a compromised coordinator. In practice, this is addressed with a defence baseline, e.g. secure aggregation, privacy protections, and integrity monitoring, so partners can benefit from FL’s collaboration gains whilst maintaining strong privacy and trust throughout training.

    2. Evaluation under advanced and realistic federated scenarios

    Whilst simulating collaborative training with uniformly distributed data provides a valuable baseline for foundational FL research, it does not fully capture the complexities inherent in real-world deployments. Future work will extend our preliminary investigations into data heterogeneity, building upon the noise-injection experiments conducted on synthetic datasets in this study. Additionally, we intend to evaluate the efficacy of maintaining a shared synthetic dataset on the central server as a reference benchmark for assessing the integrity of incoming model updates and detecting potentially malicious contributions. Finally, we plan to transition from the simulated FL environment currently facilitated by the Flower framework to a fully distributed architecture. By deploying distinct computational nodes to represent separate organisational entities, we aim to empirically investigate and address the communication bottlenecks inherent in practical federated deployments.

    Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP AI Lab team.