Training together, sharing nothing: The promise of Federated Learning

Why Federated Learning now?

In marketing, data is a competitive edge. The more audience signals, campaign performance data, and consumer behaviour a Machine Learning (ML) model can learn from, the sharper its predictions and the greater its business impact. Across the marketing ecosystem, spanning agencies, brands, and technology partners, organisations individually hold rich and valuable datasets. The potential to learn shared patterns across these assets, without ever exposing proprietary information, could unlock new capabilities: better audience targeting, smarter media spend, and faster creative optimisation at a global scale.

But here’s the challenge. Across the marketing industry, the most impactful data is inherently distributed. Each organisation, whether agency, brand, or technology partner, holds a unique piece of the data puzzle. Client contracts, privacy regulations like GDPR, and the sheer sensitivity of consumer-level data mean this data has to stay within each organisation’s walls. This is a structural reality of the industry, not a limitation of any single company. The question is whether there is a way to learn collectively from this distributed knowledge without compromising the privacy boundaries that exist for good reason.

The traditional solution, centralised ML, pools raw data from multiple sources into a single cloud to train a global model. But uploading terabytes of sensitive data to a central server creates severe network latency and exposes collaborators to data breaches and potential violations of privacy regulations.

Distributed ML methods attempted to address this by splitting training across local worker nodes. Whilst this reduces latency and avoids centralising raw data, these architectures were designed for internal computing clusters, not secure collaboration between independent companies. Without cross-organisation coordination, each organisation’s models are limited to what their own data can teach them, with no mechanism to benefit from shared learning.

Problem: The collaboration vs. privacy bottleneck

Organisations face a fundamental tension: gaining the benefits of shared learning typically requires centralising sensitive data, which privacy and contractual obligations rightly prevent. Yet without a way to learn collectively, each organisation’s models are limited to their own data alone. Neither traditional architecture offers a path to collaborative model improvement whilst keeping private data strictly where it belongs.

Federated Learning (FL) offers a way out of this dilemma by bringing the model to the data, rather than the other way around. To understand why this shift matters, let’s look at how FL actually works under the bonnet.


How Federated Learning works

Figure 1: Overview of the Federated Learning communication cycle between a central node and distributed client nodes.

Federated Learning (FL) enables multiple organisations to collaboratively train a shared model without ever centralising raw data. Instead of moving data to the model, FL brings the model to the data. Training proceeds through iterative rounds:

  1. A central server sends the current global model to all participating nodes (Blue arrow).
  2. Each node trains the model locally on its own private data (Green arrow).
  3. Nodes send back only their model updates, never the underlying data (Pink arrow).
  4. The server aggregates these updates into an improved global model and starts the next round.

Throughout this process, raw data never leaves its source. Only learned model representations are exchanged across the network.

The multimodal challenge

Whilst the above privacy-preserving framework is valuable in its own right, modern marketing data adds another layer of complexity. Organisations do not just work with spreadsheets and numbers. They work with images, video, text, audio, and structured data, often all at once. A single campaign might involve visual brand assets, ad copy, audience segments, and performance metrics across channels. Training models that can reason across these different data types, known as multimodal learning, is already one of the most demanding challenges in ML.

Now combine that with the constraints of federated learning. Each client may hold different combinations of modalities, in different formats and volumes. One partner might contribute rich visual data, another mostly text and tabular records. Coordinating a single global model that learns effectively from this fragmented, heterogeneous landscape, without ever seeing the raw data, pushes the problem to a new level of complexity.

This is precisely what makes the intersection of FL and multimodal learning so important, and so hard. If it can be made to work, it unlocks collaborative intelligence across organisations at a scale that neither approach could achieve alone.


Our objective: can Federated Learning deliver?

The promise of FL is compelling, but before investing in real-world deployment, we need to answer a fundamental question:

Does federated learning actually work well enough on multimodal marketing data to justify the tradeoff?

Centralised training will always have an inherent advantage, it sees all the data at once. The question is not whether FL can beat centralised performance, but whether it can get close enough to make the privacy and collaboration benefits worthwhile. And beyond raw performance, we need to understand how FL behaves under realistic stress conditions: more partners joining, noisy data, and complex cross-modal relationships.

To answer this, we designed a series of experiments around four key questions:

Experiment 1 — Centralised vs. federated performance

  • How close can FL get to centralised performance? In a centralised setup, the model sees all the data at once, the ideal scenario for learning. FL, by design, fragments this data across nodes. The first question is whether this tradeoff costs us meaningful accuracy, or whether FL can match centralised results despite never accessing the full dataset.
  • What happens as more clients join? In practice, a federated network might involve a handful of partners or dozens. As the number of participants grows, each node holds a smaller, potentially less representative slice of the overall data. We tested how model performance scales as we increase the number of nodes.

Experiment 2 — Resilience to noisy data

  • How robust are centralised and federated models to noisy data? Real-world datasets are messy, labels can be wrongly defined, and data quality varies across partners. We deliberately introduced noise into the multimodal dataset to simulate these imperfections and measure how much degradation the model can tolerate before performance breaks down.

Experiment 3 — Cross-modal relationships

  • How sensitive centralised and federated models to underlying cross-modal patterns? Multimodal models learn by connections between different types of data. For example, a luxury brand might target a high-income audience through a premium creative tone on a specific platform. Some of these connections appear frequently in the data, whilst others are rare. We tested whether emphasising the most frequent cross-modal patterns in our synthetic data improves performance compared to emphasising the least frequent ones, helping us understand how much the model benefits from common, naturally occurring relationships versus rare, atypical ones.

The data

For our experiments, we used a multimodal synthetic dataset generated by our own well-tested synthetic data generator, designed to mirror real-world marketing dynamics. The generator allows us to customise various elements of the data and design targeted datasets that stress-test our model architecture under controlled conditions, giving us full visibility into the factors that drive campaign performance.

Each campaign in the dataset is described using five key modalities:

  • Audience – the consumer segment being targeted
  • Brand – the positioning and perception of the brand
  • Creative – the tone and message of the campaign
  • Platform – where the campaign runs
  • Geography – the markets being targeted

Each dataset’s sample is assigned a target label, Positive (Over performing), Negative (Under performing), or Average (Average performance), indicating whether that particular combination of modalities would lead to a successful, underperforming, or average campaign outcome.

Experimental results

All federated experiments are implemented using Flower, a widely adopted open-source framework for federated learning research and deployment. Flower allows us to simulate multi-client federated setups in a controlled environment, making it possible to rigorously test different configurations before moving to a fully distributed architecture.

To ensure a fair comparison between centralised and federated setups, we kept the playing field level. Both setups use the exact same model architecture, so any performance differences come from how the model is trained, not what is being trained. In the federated setup, data is split equally across nodes, so that each partner sees a representative sample. This way, when we increase the number of nodes, any change in performance can be attributed to the scaling itself, not to differences in what each node’s data looks like.

Experiment 1: How much accuracy do we trade for privacy?

Figure 2: Impact of increasing node fragmentation on Federated Learning performance. Performance clearly degrades as the number of nodes increases from 5 to 15, compared to the baseline centralised model version.

The centralised model sets the performance ceiling at 79.67%. This is expected, when a single model has direct access to all the data at once, it has the best possible conditions to learn. No information is lost to partitioning, and no coordination overhead is introduced. It’s the ideal scenario, and the benchmark everything else is measured against.

The federated results tell a clear story: as we add more nodes, performance gradually declines. With 5 nodes, the model reaches 76.23%, a modest drop from the centralised baseline. But as we scale to 10 and then 15 nodes, scores fall to 70.29% and 67.65% respectively. The same pattern holds across all metrics, with the sharpest drops in the model’s ability to correctly identify both positive and negative cases.

Why does this happen? As more nodes join, the total dataset gets divided into smaller slices. Each node sees less data, which means each node’s local training produces a less reliable picture of the overall patterns. When the server combines these local updates, the differences between them make it harder to converge on a strong global model, an effect we call the “aggregation penalty.”

Lesson learned: FL with 5 nodes comes remarkably close to centralised performance, showing that federated collaboration is viable with minimal accuracy loss. However, as the number of nodes grows, makes it progressively harder for the global model to match centralised results.

Experiment 2: How robust are centralised and federated models to noisy data? 

In practice, marketing data is never perfectly clean. Campaign outcomes don’t fall neatly into “this worked” or “this didn’t.” Was a campaign that slightly exceeded expectations truly a success, or just average? Was a modest underperformance a failure, or noise in the measurement? Different teams may label the same outcome differently, tracking systems introduce inconsistencies, and the line between a “positive” and “average” campaign is often blurry.

To simulate this reality, we deliberately introduced noise into our synthetic dataset by blurring the boundaries between performance classes. With no noise, the labels are clean — positive, negative, and neutral outcomes are clearly separated. As we increase the noise level from low, to medium, and then to high, the boundaries between these classes increasingly overlap, making it harder for the model to tell them apart. Think of it like gradually turning up the fog: the underlying patterns are still there, but they become harder to see. The federated learning simulation for this experiment was configured with 5 participating clients, consistent with the best-performing federated setup identified in Experiment 1.

Figure 3: Performance comparison of centralised (left) and federated learning (right) configurations across increasing noise levels. Both paradigms degrade gradually, with Positive F1 and Negative F1 most affected, whilst the performance gap between the two remains approximately constant across all conditions.

As expected, both models perform best on clean data and gradually decline as noise increases. At high noise:

  • The centralised model’s score drops from 82.05% to 78.11%
  • The FL model’s score drops from 80.74% to 76.09%

The good news: neither model collapses. Even at the highest noise level both models still perform reasonably well. The overall accuracy dips, and the models struggle most with distinguishing clearly positive or clearly negative campaigns, which makes sense, since those are exactly the boundaries we blurred. However, their ability to capture general patterns across the dataset remains stable throughout.

As in Experiment 1, the centralised model maintains a consistent edge over the federated setup at every noise level, but the gap between them stays roughly the same. This means that FL doesn’t become more fragile in noisy conditions; it handles data messiness about as well as its centralised counterpart.

Lesson learned: Real-world data is inherently noisy, and any viable model must be able to handle that. Both centralised and FL models show strong resilience — performance declines gradually rather than breaking down, even when the data is heavily corrupted. Importantly, FL’s relative performance holds steady across noise levels, suggesting it is no more vulnerable to messy data than centralised training.

Experiment 3: How sensitive are centralised & federated models to underlying cross-modal patterns?

Our synthetic data generator creates campaign data based on a graph of relationships between five key factors: Audience, Brand, Creative, Platform, and Geography. Each relationship captures whether a particular combination of these factors tends to drive strong or weak campaign performance. Some of these relationships are common and obvious — they show up frequently and reflect well-known marketing dynamics. Others are rare and subtle — unusual combinations that don’t appear often but may carry uniquely valuable signal about what makes a campaign succeed or fail.

Understanding how these different types of patterns affect learning is important for both training paradigms. If the nature of the underlying data patterns matters, we need to know whether centralised and federated models respond to them in the same way — or whether one setup handles certain patterns better than the other. To investigate this, we generated three versions of our dataset, keeping everything else the same:

  • Common-first: The generator focuses on the most frequently occurring combinations and downplays the rarest ones. This gives us a dataset dominated by typical, familiar marketing patterns.
  • Rare-first: The opposite — the generator prioritises the rarest combinations and downplays the most common. This fills the dataset with unusual, less obvious patterns.
  • Middle-ground: The generator focuses on combinations that fall in the middle of the frequency spectrum, neither the most common nor the rarest.

As in Experiment 2, the federated learning simulation was run with 5 participating nodes, and performance was compared against the centralised baseline across all three dataset versions.

Figure 4: Impact of cross-modal relationships on model performance. Prioritising rare feature combinations (Rare-first) substantially improves accuracy compared with focusing on common patterns, showing that atypical relationships provide a stronger learning signal for both centralised and federated learning paradigms.

The results were striking. The Rare-first configuration dramatically outperformed the other two, achieving peak scores of 94.41% (Centralised) and 93.43% (FL), compared to scores in the 86–88% range for the Common-first and Middle-ground setups.

This tells us something counterintuitive: the model learns far more from unusual feature combinations than from common ones. The typical, frequently seen patterns are in some sense “easy”, they don’t give the model much new information. But rare combinations force the model to learn more nuanced and distinctive boundaries between what makes a campaign succeed or fail.

As in previous experiments, the centralised model maintains a small edge over FL, but the ranking between dataset strategies stays the same in both setups. Whether training centrally or federally, prioritising rare patterns is the winning strategy.

Lesson learned: Not all data is equally valuable. Prioritising on rare, atypical feature combinations produces significantly better models than focusing mostly on common patterns. This has direct implications for how we design synthetic datasets: rather than mimicking the most typical marketing dynamics, we should deliberately include uncommon combinations to give the model a richer and more discriminative learning signal.

The impact and looking ahead

This work is just the initial spark for our federated learning efforts. Verifying that the centralised ML model performance our company provides is slightly degraded under a reasonable number of users opens the discussion about delivering ML solutions that address shared challenges among clients who are reluctant to share data to tackle a common industry problem. The FL approach allows companies to securely train a shared global model on their own datasets without the risk of data leakage throughout the training process.

Although Federated Learning has been an established collaborative learning method since 2017, it remains a highly active research domain in academia and a strategic priority for industrial implementation.

Our next objective is to stress-test and further expand our federated learning (FL) infrastructure by enabling learning across nodes that hold highly heterogeneous data, with substantially different shapes, feature spaces, and underlying distributions. This introduces a number of technical challenges, including how to align representations, aggregate knowledge effectively, and maintain stable performance when local data varies significantly from node to node. Overcoming these challenges will unlock deeper insights into the robustness and scalability of our FL framework, and will allow our models to learn more effectively in realistic, decentralised settings where data heterogeneity is the norm rather than the exception.

Ready to explore the specifics? Read our full technical deep dive into Multimodal Federated Learning for a closer look at our methodology.

Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP AI Lab team.

Author

  • Emmanouil (Manos) Kritharakis is a Data Scientist at Satalia, working in the Research Lab on Graph Machine Learning and Federated Learning Systems. His research focuses on scalable graph-based methods and privacy-preserving distributed learning, with publications at leading venues including VLDB and ECML-PKDD. He brings hands-on experience building production ML systems end-to-end, from data preprocessing to deployment, bridging the gap between research and real-world applications.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *