A self-improving AI agent for optimising and explaining media performance

Introduction

Influencer marketing has matured into a multi-billion-dollar channel, and the industry’s investment in measurement, audience intelligence, and brand-safety tooling has grown alongside it. Even so, one specific challenge remains difficult to solve at scale: forecasting the engagement of an individual post before it is published. The cost of an underperforming placement can lead to missed momentum and a creative team restarting a cycle that better foresight could have shortened.

Existing predictive approaches have made meaningful progress by leveraging structured signals such as hashtag usage, visual composition, posting cadence, and they perform well for the decisions those features can inform. Where they reach a ceiling is in capturing the cross-modal, contextual qualities that separate adequate content from high-performing content: why a particular tone lands with a particular audience, or why a product placement feels native rather than intrusive. These are judgments that experienced strategists make fluently but that conventional feature pipelines were not designed to encode.

The remaining gap is less about computational power than about representational depth. What makes a post resonate is something a skilled strategist can often articulate after the fact: the influencer’s tone felt effortless, the product placement didn’t interrupt the narrative, the caption hit a cultural nerve. These judgments integrate context, intent, and audience understanding in ways that go beyond pixels and metadata alone.

This is the problem the Prediction Optimisation Agent was designed to address. Building on the structured signals that current pipelines already capture, the agent adds a layer of contextual interpretation: it examines the image, the caption, and the influencer’s history, then writes a structured natural-language description of the factors most likely to drive the post’s performance. A creative director can read this description, challenge it, and act on it, complementing quantitative scores with the kind of reasoning that makes those scores actionable.

The intuition is simple: the description that best predicts performance is, by definition, the description that best explains it. The agent iteratively refines this description by diagnosing its own errors, identifying what the previous descriptions failed to capture, and rewriting its own instructions to self-improve. Over successive iterations it converges on the specific qualities that actually drive engagement, not from hand-crafted rules, but from the systematic minimization of its own predictive error, surfacing the specific qualities that drive engagement in a form that teams can inspect, debate, and build strategy around.

The anatomy of a viral post

Predicting the performance of an ad or a social media post before publishing remains a primary objective for marketers and influencers alike. But how do you distill something as complex and subjective as a social media post into a single prediction?

Consider a typical Instagram post. It is never just a picture. It’s a complex combination of different data types working together simultaneously. Take the influencer post shown in Figure 1. To truly understand why this post succeeds or fails, you need to consider:

The image itself — composition, lighting, color palette, subjects, products, and setting.
The caption — where the influencer might share a discount code, crack a joke, or strike an emotional chord.
The influencer’s identity — their bio, follower count, niche credibility, and historical performance.
The metadata — the time of day, geographic location, hashtags, and platform-specific context.

Each of these dimensions carries signal. None of them tells the full story alone. The magic, and the difficulty, lies in how they interact.

Figure 1: A typical influencer post. Traditional analytics struggle to measure the combined impact of the visual aesthetic, the caption’s tone, and the underlying metadata. To accurately predict engagement for a post like this, our system analyzes the image, caption and influencer statistics together as a single cohesive unit. All persons, brands and products depicted in this image are AI generated.

Where current approaches reach a ceiling

The prevailing approach to content-level prediction decomposes a post into its component signals and processes each through a dedicated model. This modular architecture has clear engineering advantages. Each component can be developed, validated, and updated independently and it performs well for the decisions those individual signals can inform:

Computer Vision Models: Isolated image-recognition algorithms scan the visual to detect objects, people, or products. Separate models handle face detection and emotion recognition. The output is a list of labels: “person detected,” “beverage detected,” “outdoor setting.”
Text Analysers & OCR: NLP tools parse the caption, counting hashtags, flagging emojis, scoring sentiment. Meanwhile, optical character recognition (OCR) software reads any text visible within the image itself.
Tabular Metadata Algorithms: An algorithm ingests structured fields like follower count, posting time, engagement history, and produces its own independent prediction.

Engineers then attempt to fuse these outputs into a single forecast. Each module performs its own task well, but because the signals are extracted independently, the fused representation inherits a structural limitation: it has difficulty capturing meaning that emerges from the interaction between modalities, qualities that exist not in the image or the caption alone, but in the relationship between them.

Consider a concrete example. Imagine a fitness influencer posts a photo of herself laughing mid-sip from an energy drink, with the caption: “My face when someone says they don’t need pre-workout 😂.”

A computer vision model would tag this as: “person detected,” “beverage detected,” “outdoor setting,” “positive facial expression.” A text analyser would count the hashtags and flag the emoji. But the joke, the caption reframing the laugh as a reaction shot, turning a standard product image into a relatable meme, lives in the interplay between image and text. It is not a property of either signal individually, and a pipeline that processes them separately has no natural place to represent it.

Similarly, the fact that this influencer is a certified nutritionist (meaning her credentials paired with an energy drink carry implicit credibility that a fashion influencer holding the same product would not), is a cross-modal inference that requires linking metadata (professional background) with visual content (product in hand) and audience expectation. This is the kind of contextual reasoning that falls outside the scope of independently trained modules.

Humour, irony and credibility through context are the cross-modal qualities that often separate high-performing content from competent content. They are also the qualities that a modular, signal-by-signal architecture was not designed to represent. Closing this gap requires a fundamentally different representational strategy, one that reasons over all modalities jointly from the outset.

Our approach: Unifying multimodal data through semantic translation

To address the above, we developed the Prediction Optimisation Agent, a self-improving AI agent that unifies all available data into a single format it can reason about: natural language.

The agent’s core mechanism is straightforward. It takes complex, multimodal data (numerical metrics, images, video, and text captions) and converts everything into a single natural-language paragraph that holistically describes the post’s content, aesthetic, tone, and context. By projecting all of these distinct formats into readable text, heterogeneous data is normalised into a structure that a language model can process as a unified whole.

Instead of treating image and text as separate inputs, the agent uses a single prompt to digest all available information at once. Multimodal LLMs serve as one of the agent’s tools, acting as universal feature extractors that capture the abstract, human-centric concepts that traditional pipelines structurally cannot.

But the agent does not simply produce any description and hope it is useful. It is driven by a feedback loop grounded in predictive error: the descriptions it generates are used to forecast engagement, those forecasts are compared against real outcomes, and the resulting errors tell the agent exactly how much predictive value its current descriptions are capturing and how much they are missing. Through successive rounds of this loop, the agent autonomously rewrites the instructions that govern how descriptions are composed, converging on the paragraph structure that maximises predictive accuracy.

This error-driven process has a profound consequence for explainability. The description the agent converges on is not a generic summary. It is the description that the agent has discovered, through empirical optimisation, to be the most predictive of real engagement outcomes. In other words, the features highlighted in the final description are there because they matter, because including them reduced prediction error. When the optimised description of a high-performing post calls out “candid humour,” “golden-hour lighting,” and “influencer credibility,” those aren’t arbitrary observations. They are the factors the agent learned to pay attention to because they measurably improved its ability to predict what performs well.

How the Prediction Optimisation Agent works

The Prediction Optimisation Agent orchestrates three internal stages in a continuous feedback loop: it observes a post, describes it, predicts its performance, measures how far off it was, and then rewrites its own instructions to produce better descriptions next time — closing the loop and getting measurably better with every iteration, without any human intervention.

Figure 2: The Prediction Optimisation Agent architecture. Raw media, metadata, and an initial prompt are fed into Stage 1 (Semantic Translation), which produces a natural-language description of the post. Stage 2 (Engagement Predictor) reads that description and predicts engagement. Prediction errors are then passed to Stage 3 (Self-Optimiser), which autonomously analyses what went wrong and rewrites the Stage 1 prompt, closing the feedback loop and improving the system’s accuracy with every iteration.

Stage 1: Semantic Translation

The agent begins by ingesting the raw post — the image or video file, the caption text, and all available metadata (follower count, posting time, influencer bio, etc.). Using a multimodal LLM as its translation tool and guided by a detailed set of internal instructions (its prompt), it produces a single, rich natural-language paragraph that captures not just what is in the post, but what the post means: the visual mood, the emotional tone, the relationship between caption and image, and the brand alignment.

The quality and focus of this description is entirely governed by the prompt and as we will see, it is the prompt that the agent learns to optimise.

Stage 2: Engagement Predictor

The agent passes the semantic paragraph to its prediction tool, a model that evaluates the post’s potential performance based entirely on the natural-language description from Stage 1.

The predictor can be any machine learning model with the ability to understand text paragraphs. It can be based on trees, deep learning, or any other compatible architecture. It can even be a fine-tuned LLM, upskilled for predictions in a specific domain. Our Agent is compatible with all these options.

In all cases, the quality of the prediction is only as good as the quality of the semantic description. A richer, more targeted paragraph gives the predictor more signal to work with, which is precisely why improving the description is the single most important action the agent can take.

Stage 3: Self-Optimizer

This is the core of the Prediction Optimisation Agent’s intelligence — and what makes it a true agent rather than a static model.

After predictions are generated for a batch of posts and compared against actual engagement outcomes, the agent enters a self-evaluation cycle. The agent is presented with:

The prompt. It used to generate descriptions in the current round.
The per-post prediction errors. For each post, the predicted engagement versus the actual engagement, and the magnitude and direction of the miss.
The aggregate error metrics. Overall accuracy statistics such as mean absolute error, broken down by content category, audience size, or post type, revealing systematic blind spots.
The complete history of prior prompts and their corresponding error metrics, so the agent can assess how each previous prompt version performed, which changes improved accuracy, and which did not.

Armed with this structured feedback, the agent does not guess at improvements. It reasons about them. It asks: Which types of posts am I consistently getting wrong? What do those posts have in common? What dimensions of the content is my current prompt failing to capture?

Based on this analysis, the agent autonomously rewrites the prompt instructions used in Stage 1. The rewritten prompt is not a minor tweak. It can add entirely new analytical dimensions, remove instructions that proved unhelpful, or restructure the description format to foreground the features most correlated with engagement.

The critical insight is that the predictive error is the signal that drives the entire system. Every improvement the agent makes to its prompt is motivated by a concrete, quantifiable gap between what it predicted and what actually happened. The agent isn’t optimising descriptions to be “better” in some abstract sense, it is optimising them to be more predictive, and the error metrics tell it exactly where it is falling short.

The agent in action: Watching the prompt evolve

To make this concrete, let’s follow the agent through three optimization rounds on a single post: the Aura Energy ad featuring two young models in blue athletic wear, photographed from a dramatic low angle against a vivid blue sky lined with palm trees.

Figure 3: Example Instagram post used for semantic translation. All persons, brands and products depicted in this image are AI generated.

Round 1: The baseline

The agent starts with a deliberately simple, generic prompt:

Prompt (Round 1): “Analyse this Instagram post and the provided metadata. Describe the visual elements of the image, the style, and the mood. Then, explain how the image relates to the brand or influencer described in the metadata.”

The agent’s description:

“This Instagram post is a high-energy lifestyle advertisement for Aura Energy, utilizing a vibrant, ‘streetwear-meets-summer’ aesthetic to target a young, active demographic. The image uses a low-angle shot, which makes the subjects appear dominant and ‘larger than life.’ Two young models are featured in athletic/leisure wear. The female subject in the foreground holds the product directly toward the lens and wears reflective ‘matrix-style’ sunglasses . The male subject in the background wears a blue Jordan hoodie. The post aligns perfectly with the brand’s identity as described in the metadata. The visual style is polished and fashionable, positioning the drink as a lifestyle accessory rather than just a functional caffeine source. The caption emphasises ‘Zero Sugar’ and staying ‘sharp’ and ‘unstoppable.’”

This is already more nuanced than what a single-signal visual model would typically produce. However, the predictor, working from this description, significantly underestimates the post’s engagement.

Round 2: The agent adapts

The self-optimiser reviews the Round 1 errors and identifies the gap: the prompt never asked the agent to evaluate visual cohesion between talent styling and brand identity, or the psychological impact of compositional techniques like camera angle. It rewrites the prompt:

Prompt (Round 2): “Analyse this Instagram post and the provided metadata. Describe the visual elements, style, mood, and composition. Assess the visual cohesion between the talent’s styling (wardrobe, accessories, posture) and the brand’s visual identity. Evaluate how compositional choices — such as camera angle, framing, and subject placement — contribute to the aspirational or emotional tone of the post. Explain how the image, caption, and influencer profile work together to create a unified message.”

The agent’s description now reads:

“This Instagram post is a high-energy lifestyle advertisement for Aura Energy that demonstrates strong visual-brand cohesion. The coordinated blue wardrobe across both models — the female subject’s athletic top and the male subject’s Jordan hoodie — creates a seamless extension of the brand’s blue-and-white visual identity, embedding the product within the scene rather than presenting it as a foreign element. The dramatic low-angle shot elevates the subjects into an aspirational, dominant position, making the viewer look up at them — a compositional choice that reinforces the caption’s message of being ‘unstoppable.’ The female subject holds the product at the camera’s focal point with deliberate confidence, with her sunglasses only adding to this effect , while the male subject’s relaxed posture add an effortless credibility. Together, the image, caption, and influencer positioning frame the energy drink not as a beverage but as a lifestyle symbol for a young, fashion-conscious audience.”

The prediction for this post improves noticeably. Across the batch, the error on brand-partnership posts drops. But the agent notices it is still missing on posts where influencer credibility plays a key role — posts where who the influencer is matters as much as what’s in the frame.

Round 3: The agent sharpens further

The self-optimiser reviews Rounds 1 and 2 side by side, seeing that the Round 2 prompt improved brand-cohesion analysis but still lacks attention to influencer authority and audience trust signals. It rewrites again:

Prompt (Round 3): “Analyse this Instagram post and the provided metadata. Describe the visual elements, style, mood, and composition. Assess the visual cohesion between the talent’s styling and the brand’s visual identity. Evaluate how compositional choices contribute to the aspirational or emotional tone. Analyse the influencer’s credibility and authority relative to the product category — consider their professional background, niche expertise, audience demographics, and how these factors amplify or undermine the authenticity of the endorsement. Assess the emotional resonance of the caption: does it use humour, aspiration, relatability, or urgency, and how does this tone interact with the visual content? Explain how all elements — image, caption, influencer identity, and audience context — work together to create a unified, persuasive message.”

The agent’s description:

“This Instagram post is a high-energy lifestyle advertisement for Aura Energy that achieves strong visual-brand cohesion and leverages influencer-audience alignment to maximise credibility. The coordinated blue wardrobe across both models creates a seamless extension of the brand’s visual identity. The dramatic low-angle shot places the subjects in an aspirational, dominant position, reinforcing the caption’s message of being ‘unstoppable.’ The female subject’s direct-to-camera product hold at the image’s focal point conveys deliberate confidence, along with her reflective sunglasses, while the male subject’s relaxed posture add authenticity. Critically, the influencer’s established presence in the fitness and lifestyle niche — with a predominantly young, health-conscious following — makes the ‘Zero Sugar’ energy drink endorsement feel credible rather than transactional. The caption’s tone (‘Focus your energy. Unlock your potential’) uses aspirational brevity that mirrors the visual intensity, creating a tight loop between what the viewer sees and what they read. The overall effect positions the product as an identity marker for an active, fashion-forward audience, rather than a commodity — a framing that this influencer’s audience is primed to respond to.”

This description now captures visual strategy, brand cohesion, compositional psychology, influencer credibility, audience alignment, and caption-image interaction, all in a single readable paragraph. The prediction error drops significantly. A marketing team reading this description can immediately see why the post works: the credibility is authentic, the visual identity is seamless, and the caption reinforces the image rather than competing with it.

The agent arrived at this level of descriptive richness not because someone hand-engineered the prompt, but because the predictive errors from Rounds 1 and 2 told it exactly what it was missing.

Real-world application: Lessons and implications

To validate the Prediction Optimisation Agent in a real-world scenario, it was tested using the dataset compiled by Kim et al, as a part of their WSDM 2021 paper (Discovering Undisclosed Paid Partnership on Social Media via Aspect-Attentive Sponsored Post Learning“), which is available to use for research purposes upon request from the authors. The dataset contains approximately 10.18 million posts spanning a diverse range of content categories and audience sizes. The results revealed key insights about both the agent’s learning dynamics and the practical implications for marketing teams.

The agent learns what matters autonomously

By processing its own historical error rates, the Prediction Optimisation Agent autonomously learned to rewrite its prompts, producing richer, more targeted post descriptions with every iteration, which in turn drove increasingly accurate predictions.

Figure 4: Autonomous Learning: The chart tracks the agent’s predictive performance (y-axis) across successive optimisation rounds (x-axis). Each point represents a full cycle of the agent’s loop: describe → predict → evaluate → rewrite. The trend demonstrates that as the agent iteratively refined its own prompt, guided by quantitative error metrics from prior rounds, forecast accuracy improved consistently and autonomously, without any human prompt engineering.

The agent’s optimisation works by feeding it the complete history of prior prompts alongside rigorous, quantitative error breakdowns from every previous round. Armed with this granular self-knowledge, the agent identifies precisely which content dimensions it has been under-analysing (e.g. production quality, humour style, credibility signals, visual-brand cohesion) and meticulously updates its prompt to capture those features in subsequent iterations. Each round’s prompt is built on the empirical lessons of every round before it, producing a compounding improvement curve where the descriptions become progressively more predictive and, as a direct consequence, more explanatory.

Conclusions

The Prediction Optimisation Agent demonstrates something that extends well beyond social media: natural-language prompts can be treated as tunable parameters, optimised autonomously by the AI itself. By allowing the agent to refine its own instructions through predictive error, the system progressively discovers what drives human engagement and expresses that knowledge in plain language.

For marketing teams, this represents a meaningful step beyond predictive tools that surface a score without exposing the reasoning behind it. When a team wants to understand why one campaign outperforms another, they do not need to interpret abstract model coefficients. They can compare the text profiles of a high-performing post and a low-performing one, side by side, and immediately see the differences the AI picked up on: one might highlight “authentic, candid composition with humour-driven caption and strong influencer-niche alignment,” while the other notes “generic studio shot with formulaic promotional language and weak audience-brand fit.” The patterns reveal themselves in plain English, and they are the right patterns, because the agent discovered them by optimising for predictive accuracy.

In practice, this means teams can run draft campaign concepts through the system before committing production and media budgets, getting a readable assessment of how the AI interprets the creative. Designers and copywriters can test variations of a post and compare descriptions side by side to see, in their own language, which direction resonates more strongly. And by normalizing visual and written media into a unified, readable format, brands can pair creative intuition with precise forecasting, treating creative assets as predictable drivers of revenue.

The same architectural pattern, semantic translation, error-driven prediction, and autonomous self-optimisation, is not limited to social media. Any domain where success depends on understanding the interplay of qualitative and quantitative signals, from political messaging to product design to entertainment, stands to benefit from systems that can read, reason, reflect, and improve on their own. The question is no longer whether AI can predict what resonates with people. It is how effectively we can build systems that refine that understanding autonomously, with human judgment guiding the outcome.

Ready to explore the specifics? Read our full technical deep dive into Self-Improving Performance Agent Pod for a closer look at our methodology.

Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP Research.

Authors

Andreas Stergioulas

Andreas Stergioulas is a Senior Data Scientist. He specializes in LLM-based agentic architectures, generative models, and computer vision for large-scale enterprise applications. Currently working at Satalia (WPP Group), he designs and deploys production-grade AI solutions — including custom multi-agent LLM systems and diffusion-based image generation pipelines — for globally recognized clients. He holds an M.Sc. in Electrical and Computer Engineering and is the author of peer-reviewed publications in venues such as CVPR Workshops and IEEE Transactions on Multimedia.
Ilias Papastratis

Ilias Papastratis is a Senior Data Scientist based in United Kingdom. He specializes in LLM-based agentic architectures, generative models, and computer vision for large-scale enterprise applications. Currently working at Satalia (WPP Group) , he designs and deploys production-grade AI solutions—including custom multi-agent LLM systems and diffusion-based image generation pipelines —for globally recognized clients.

Previously, he spent five years as a Deep Learning Researcher and developed computer vision models for real-time sign language translation and industrial object recognition. He holds a combined Bachelor’s and Master’s degree in Electrical Engineering and Computer Science from the University of Patras , alongside an MSc in Digital Media and Computational Intelligence from Aristotle University of Thessaloniki. In addition to being the author of peer-reviewed publications, he actively contributes to open-source deep learning projects.

A self-improving AI agent for optimising and explaining media performance

Introduction

Where current approaches reach a ceiling

Our approach: Unifying multimodal data through semantic translation

How the Prediction Optimisation Agent works

Stage 1: Semantic Translation

Stage 2: Engagement Predictor

Stage 3: Self-Optimizer

The agent in action: Watching the prompt evolve

Round 1: The baseline

Round 2: The agent adapts

Round 3: The agent sharpens further

Real-world application: Lessons and implications

The agent learns what matters autonomously

Conclusions

Authors

Comments

Leave a Reply Cancel reply

More posts

Agree. Transact. Verify.

Learning through experience: teaching the viral Hermes agent to automate our work

A 50-line python function outperformed every frontier LLM – With 100% accuracy

MiroFish: Is swarm intelligence worth the cloud bill?