{"id":1706,"date":"2026-05-11T11:10:13","date_gmt":"2026-05-11T11:10:13","guid":{"rendered":"https:\/\/cms.research.wpp.com\/?post_type=research_feed&#038;p=1706"},"modified":"2026-06-24T13:56:10","modified_gmt":"2026-06-24T13:56:10","slug":"evaluating-the-creativity-of-ai-systems","status":"publish","type":"research_feed","link":"https:\/\/cms.research.wpp.com\/?research_feed=evaluating-the-creativity-of-ai-systems","title":{"rendered":"Evaluating the creativity of AI systems"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">&#8220;<em>Measuring the unmeasurable. Ranking the unrankable.<\/em>&#8220;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the rapidly evolving landscape of Artificial Intelligence (AI), two questions remain particularly elusive and particularly consequential:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><strong>1. Can AI be truly creative?<\/strong>\u00a0<strong>2. Can we rank AI agents based on their creativity?<\/strong><\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">These are not rhetorical questions. They are the founding hypotheses of this project.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The first touches on something that has historically felt beyond measurement: the quality of an original idea. The second demands a fair, repeatable, and objective method for comparison across different types of players, at scale. This challenge is especially sharp in\u00a0<strong>advertising<\/strong>, where creative ideas are the currency of impact. Traditional evaluation relies on subjective human judgment which can be slow, expensive, and inconsistent across reviewers. At the same time, as Large Language Models (LLMs) transform tasks from coding to translation, a critical gap has persisted:\u00a0<strong>how do we benchmark creativity rigorously?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">At\u00a0the <strong>WPP Research<\/strong>, we ran a set of experiments demonstrating that modern LLMs can act as reliable, scoring agents that grade creative ideas across multiple established dimensions with measurable consistency. That finding unlocked something important: if an LLM can judge creativity, we can build a system that does so systematically and then use that system to rank which AI creates the best ideas. The\u00a0<strong>Creativity Evaluation Agent<\/strong>\u00a0is a modular, multi-agent system built at\u00a0<strong>WPP Research<\/strong>\u00a0that pursues two distinct but deeply intertwined goals:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Goal 1: Build a scalable creative evaluation engine<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Design and deploy an AI agent, capable of evaluating any marketing campaign idea automatically and consistently, across\u00a0<strong>six industry-grounded creativity frameworks<\/strong>\u00a0simultaneously. A user submits a campaign idea (text or PDF) via the web UI or API. Then, our multi-agent system evaluates it across all six frameworks in parallel and returns a structured report with dimension-level scores and qualitative commentary in\u00a0<strong>15\u201325 seconds<\/strong>.<br><\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"704\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/idea_output-1024x704.png\" alt=\"\" class=\"wp-image-1737\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/idea_output-1024x704.png 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/idea_output-300x206.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/idea_output-768x528.png 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/idea_output.png 1306w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><em>Figure 1: Example output for a creative idea<\/em><\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Goal 2: Benchmark who creates the best ideas: LLMs, humans, and AI agents<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The evaluation engine is also the foundation of a\u00a0<strong>creative benchmarking tournament<\/strong>. The second goal is to use the agent as an objective judge to measure and rank the creative output of different\u00a0<em>players.<\/em> For this exercise the players have been state-of-the-art LLMs.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"364\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/creativity_eval-1024x364.png\" alt=\"\" class=\"wp-image-1738\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/creativity_eval-1024x364.png 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/creativity_eval-300x107.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/creativity_eval-768x273.png 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/creativity_eval-1536x547.png 1536w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/creativity_eval.png 1742w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 2: Creative benchmarking tournament<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">To do this rigorously, we adopted the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Glicko_rating_system\"><strong>Glicko2 rating system<\/strong><\/a>\u00a0(also used in games such as <a href=\"https:\/\/en.wikipedia.org\/wiki\/Elo_rating_system\">chess Elo<\/a>, <a href=\"https:\/\/www.counter-strike.net\/\">Counter Strike<\/a> and <a href=\"https:\/\/www.dota2.com\/home\">Dota 2<\/a>), running a round-robin tournament where each player&#8217;s ideas compete head-to-head. The result is a continuously updatable\u00a0<strong>creative leaderboard<\/strong>\u00a0which ranks AI creativity in advertising.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">From foundations to architecture<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">To move from subjective opinion to objective evidence, the system stands on the shoulders of giants, synthesising established psychometrics like the\u00a0<a href=\"https:\/\/psycnet.apa.org\/doiLanding?doi=10.1037%2Ft05532-000\"><strong>Torrance Tests of Creative Thinking<\/strong><\/a>\u00a0<strong>(TTCT)<\/strong>\u00a0with industry-proven frameworks. The challenge lies in translation: turning these theoretical foundations into an autonomous agent capable of\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/2504.15784\">automated creativity evaluation<\/a>\u00a0with human-like nuance and explainable logic supported by a\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2310.08433\">confederacy of specialised models<\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>The multi-agent ecosystem<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Rather than relying on a single monolithic judge, the system orchestrates a specialised &#8220;squad&#8221; of sub-agents, <strong>each one encoding a distinct evaluation technique from creativity science or brand strategy.<\/strong>\u00a0Some measure the\u00a0<strong>quality of the output<\/strong>\u00a0\u2014 how effective, original, and strategically durable the idea is. Others measure the\u00a0<strong>quality of the thinking<\/strong>\u00a0\u2014 how expansive, surprising, and culturally grounded the generative process behind it is. Together, they cover the full spectrum from practical effectiveness to creative cognition.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Agent<\/strong><\/th><th><strong>What It Measures<\/strong><\/th><th><strong>Grounded In<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Effectiveness (WPP)<\/td><td>Does the idea work as a campaign? How sharply framed, how boldly inspired, how relevant, how impactful?<\/td><td>Proprietary WPP creativity framework built around four dimensions of creative excellence, calibrated against real campaign performance across multiple brands and markets. Inspired by <a href=\"https:\/\/hbr.org\/2024\/04\/what-makes-some-ads-so-powerful\">research on inspiration published in\u00a0Harvard Business Review<\/a>, the framework evaluates the &#8220;DNA of what makes ideas inspiring.&#8221;<\/td><\/tr><tr><td>Generative Flow (FFE)<\/td><td>How broad and varied is the thinking? Does the idea explore multiple formats, categories, and angles?<\/td><td>FFE\u00a0(Fluency, Flexibility, Elaboration) dimensions from creativity research, benchmarked against\u00a0marketing creativity datasets.<\/td><\/tr><tr><td>Divergent Creativity (UOS)<\/td><td>Is the idea useful, original, and surprising?<\/td><td>UOS (Usefulness, Originality, Surprise) framework from divergent thinking literature.<\/td><\/tr><tr><td>Creative Strategist (UUU)<\/td><td>Is the idea unique, unexpected, and unforgettable enough to endure?<\/td><td>UUU (Unique, Unexpected, Unforgettable) brand longevity assessment, utilising\u00a0multi-domain evaluation techniques.<\/td><\/tr><tr><td>Conceptual Distance (OSCAI) ([OSCAI &#8211; LLM Scoring<\/td><td>Open Creativity Scoring](<a href=\"https:\/\/openscoring.du.edu\/ocsai\">https:\/\/openscoring.du.edu\/ocsai<\/a>))<\/td><td>How far apart are the connected concepts? Distinguishes mundane links from highly original leaps.<\/td><\/tr><tr><td>Semiotics<\/td><td>How is meaning constructed through cultural symbols? Is the creative execution aligned with the intended brand message?<\/td><td><a href=\"https:\/\/en.wikipedia.org\/wiki\/Semiotics#Saussure\">Saussurean sign systems<\/a>\u00a0and\u00a0<a href=\"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/092137400001200102\">cultural logic<\/a>.<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Table 1: Scores description<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>How the agents work: scoring architecture and self-correction<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Every sub-agent is governed by the same rigorous four-part prompt architecture. Each is anchored by a specialised\u00a0<strong>Role<\/strong>\u00a0that encodes its evaluation lens, followed by a granular\u00a0<strong>Definition of Score<\/strong>\u00a0that translates abstract dimensions into measurable benchmarks. The core logic is driven by precise\u00a0<strong>Instructions<\/strong>\u00a0calibrated against\u00a0<strong>few-shot examples<\/strong>\u00a0drawn from a ground truth library of ideas and historical scores. This architecture feeds into a\u00a0<strong>Critic-Refiner<\/strong>\u00a0cycle, where specialised Refiner agents challenge initial assessments and resolve contradictions. The refinement doesn&#8217;t trigger on every run, but its presence is deliberate: we observed during evaluation that LLMs had a tendency towards optimism in scoring, and this self-correction layer ensures the final output remains robust and consistent.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In other words, the agent descriptions above define\u00a0<strong>what<\/strong>\u00a0each agent evaluates; the shared architecture is\u00a0<strong>how<\/strong>\u00a0they all do it.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"502\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/architecture-1024x502.png\" alt=\"\" class=\"wp-image-1739\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/architecture-1024x502.png 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/architecture-300x147.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/architecture-768x377.png 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/architecture-1536x754.png 1536w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/architecture.png 1616w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 3: The architecture of the Creativity Evaluation Agent.<\/figcaption><\/figure>\n\n\n\n<h1 class=\"wp-block-heading\"><strong>Aligning sub-agents with human intuition<\/strong><\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">Before trusting the system, we needed to prove it thinks like a human Creative Director. We assembled a\u00a0<strong>Ground Truth dataset<\/strong>\u00a0in collaboration with WPP creative professionals: 20 campaign ideas, each captured as a title and description, scored by consensus using the WPP effectiveness score. We then ran each idea through our evaluation pipeline across three frontier models and measured the\u00a0<strong>prediction error<\/strong>.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Model<\/strong><\/th><th><strong>Avg. Error (Agent vs. Human Scores)<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Gemini 2.5<\/td><td>2.2<\/td><\/tr><tr><td>Claude Sonnet 4.5<\/td><td>1.0<\/td><\/tr><tr><td><strong>Gemini 3<\/strong><\/td><td><strong>0.7<\/strong><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Table 2: Model comparison<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><a href=\"https:\/\/deepmind.google\/models\/gemini\/\">Gemini 3<\/a> emerged as the definitive choice and was deployed across all sub-agents. We also validated\u00a0<strong>repeatability<\/strong>: the same idea, scored across independent runs, held stable with low standard deviation across all frameworks. The WPP score was the most precise signal, but UOS, FFE, and UUU all showed the same core reliability.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"960\" height=\"732\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/errors.png\" alt=\"\" class=\"wp-image-1740\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/errors.png 960w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/errors-300x229.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/errors-768x586.png 768w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/><figcaption class=\"wp-element-caption\">Figure 4: error distributions of the creative evaluation agent when powered by Claude Sonnet 4.5, Gemini 2.5 and Gemini 3.<\/figcaption><\/figure>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><strong>The takeaway:<\/strong>\u00a0our benchmarks are grounded in data, not AI randomness. Repeatability is a measured property of this system and not an assumption.<\/p>\n<\/blockquote>\n\n\n\n<h1 class=\"wp-block-heading\"><strong>Case study<\/strong><\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">With the agent LLM validated, we moved to a real-world stress test for a furniture company. This challenge asked different LLMs to tackle a nuanced brief rooted in a specific cultural tension: North Asian families with millennial parents and preteen children are living under the same roof but feeling worlds apart \u2014 clutter, gaming, and conflicting needs for &#8220;me time&#8221; vs. &#8220;we time&#8221; are eroding family connection. The brief demanded ideas that were cross-market, minimal-dialogue, and crucially,\u00a0<strong>not too warm or safe<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Each LLM&#8217;s output was run through the full evaluation ecosystem. Every sub-agent independently scored the idea along its respective dimension, and the resulting sub-scores were\u00a0<strong>summed into a single composite total<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To generate the ideas, we used both\u00a0<strong>standalone frontier LLMs<\/strong>\u00a0(<a href=\"https:\/\/openai.com\/index\/introducing-gpt-5\/\">GPT-5<\/a>, <a href=\"https:\/\/deepmind.google\/models\/gemini\/\">Gemini 3<\/a>, <a href=\"https:\/\/www.anthropic.com\/news\/claude-sonnet-4-5\">Claude 4.5 Sonnet<\/a>) and\u00a0<a href=\"https:\/\/www.wpp.com\/en\/news\/2026\/01\/wpp-launches-agent-hub-on-wpp-open-providing-clients-with-access-to-advanced-agentic-ai\"><strong>WPP&#8217;s Creative Brain<\/strong><\/a>\u00a0\u2014 a multi-agent ideation system available through WPP Open&#8217;s Agent Hub that wraps an LLM in a structured creative process, guiding it through strategic reframing, lateral thinking, and iterative refinement before producing a final concept. By testing Creative Brain alongside standalone models, the challenge reveals how much of creative quality comes from the model itself versus the orchestration around it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Because the raw evaluation pillars operated on different scales \u2014 WPP uses a 0\u201312 sum, FFE averages to 0\u20133, while UOS and UUU average to 0\u20135 \u2014 direct comparison across pillars was misleading. All scores were normalized to a common\u00a0<strong>0\u201310 range<\/strong>\u00a0so that each pillar contributes equally to a\u00a0<strong>maximum total of 40<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It should be noted that two scoring sub-agents, OSCAI and Semiotics were excluded from the evaluations because of their high <strong>scoring variance<\/strong> (see Section 3.6 of the technical report).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>1. GPT-5: the gamification grandmaster<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">GPT-5 led the pack with a total score of\u00a0<strong>37.44<\/strong>. Its winning strategy,\u00a0<strong>&#8220;Co\u2011op Mode: Make Home Happen,&#8221;<\/strong>\u00a0reframed the entire experience of home life as a collaborative video game the whole family plays together. The central insight: small home changes don&#8217;t stick because they don&#8217;t feel\u00a0<em>rewarding<\/em> but games do. By turning clutter into the &#8220;boss&#8221; and the furniture company\u2019s solutions into unlockable &#8220;side quests,&#8221; GPT-5 built an end-to-end ecosystem that made organisation feel joyful rather than obligatory.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Pillar<\/strong><\/th><th><strong>Initiative<\/strong><\/th><th><strong>Description &amp; Impact<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>1. Launch Film<\/strong><\/td><td><strong>&#8220;Clutter is the Boss&#8221;<\/strong><\/td><td>A 60s hero film where a family&#8217;s living room transforms into a co\u2011op game interface\u2014prompts like\u00a0<strong>&#8220;Inventory Full&#8221;<\/strong>\u00a0and\u00a0<strong>&#8220;Find Calm&#8221;<\/strong>\u00a0appear as they &#8220;equip&#8221; products to defeat the clutter boss. No spoken lines, region-agnostic SFX.\u00a0<em>Reframes organisation as play; makes the campaign cross-market viable without localisation.<\/em><\/td><\/tr><tr><td><strong>2. Social\/UGC<\/strong><\/td><td><strong>&#8220;Home Side Quests&#8221;<\/strong><\/td><td>Weekly micro-challenges (under 10 min) like &#8220;Create a charging dock&#8221; or &#8220;Flip sofa to study.&#8221; Augmented Reality (AR) filters add level meters and badges; families post before\/afters to the\u00a0<strong>#CoOpHome Challenge<\/strong>\u00a0with creator duets from gaming and parent influencers.\u00a0<em>Drives sustained engagement and organic reach through participatory content.<\/em><\/td><\/tr><tr><td><strong>3. Retail\/Shoppable<\/strong><\/td><td><strong>&#8220;Co\u2011op Kits&#8221;<\/strong><\/td><td>Curated in-store and online bundles organised by\u00a0<em>mission<\/em>\u2014Study Calm, Party Fast Reset, Balcony Green Break\u2014each bundling storage + lighting + organisers. Every kit includes a QR\u00a0<strong>&#8220;Quest Card&#8221;<\/strong>\u00a0guiding micro-steps with estimated time saved.\u00a0<em>Turns the purchase into the start of a new quest; bridges content to commerce.<\/em><\/td><\/tr><tr><td><strong>4. In-Store Experience<\/strong><\/td><td><strong>&#8220;Demo Levels&#8221;<\/strong><\/td><td>Stores host timed tidy-up challenges where kids and parents compete together to reorganise a mock room against the clock. Winners earn collectible stickers and discount codes.\u00a0<em>Transforms the retail visit into an extension of the campaign&#8217;s game logic; drives foot traffic through experiential play.<\/em><\/td><\/tr><tr><td><strong>5. Digital Ads<\/strong><\/td><td><strong>&#8220;Skip the Clutter&#8221;<\/strong><\/td><td>YouTube bumpers using skip-button logic\u2014&#8221;Skip the clutter in 3\u20262\u20261&#8243;\u2014to land a single, punchy product solve within seconds.\u00a0<em>Leverages ad format mechanics as creative device; delivers product utility in pre-roll.<\/em><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Table 3: Campaign idea produced by GPT-5.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The campaign&#8217;s minimal-dialogue, SFX-driven creative approach ensured cross-market viability across North Asian markets without localisation, while the consistent game language (&#8220;side quests,&#8221; &#8220;boss,&#8221; &#8220;level up,&#8221; &#8220;equip&#8221;) created a unified system that scored perfect\u00a0on coherence. Every asset closed with the same recontextualised tagline:\u00a0<em>&#8220;Furniture company. Make Home Happen.&#8221;,<\/em> framed not as an aspiration, but as a mission objective.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>2. Creative Brain powered by Gemini: the multiverse architect<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The Creative Brain powered by Gemini followed closely with a score of\u00a0<strong>36.95<\/strong>. Its\u00a0<strong>&#8220;Room-Sync Chronicles&#8221;<\/strong>\u00a0campaign took a fundamentally different approach: rather than gamifying the\u00a0<em>solution<\/em>, it gamified the\u00a0<em>problem<\/em>. The central reframe, that every piece of &#8220;clutter&#8221; in a preteen&#8217;s bedroom is actually a physical anchor for their digital and imagined worlds \u2014 gave parents a radically empathetic lens through which to see their child&#8217;s space. The furniture company storage wasn&#8217;t positioned as a way to &#8220;hide the mess&#8221; but as a way to &#8220;power the multiverse.&#8221;<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Pillar<\/strong><\/th><th><strong>Initiative<\/strong><\/th><th><strong>Description &amp; Impact<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>1. Launch Film<\/strong><\/td><td><strong>&#8220;Reality Reboot&#8221;<\/strong><\/td><td>A cinematic 60s spot where a mother opens her son&#8217;s door and sees a mess\u2014but the room\u00a0<strong>&#8220;glitches&#8221;<\/strong>\u00a0into a high-fidelity game landscape. A box unit becomes a\u00a0<strong>loot chest<\/strong>; a gaming chair becomes a\u00a0<strong>pilot&#8217;s cockpit<\/strong>. VFX borrows from game trailers.\u00a0<em>Subverts the &#8220;messy room&#8221; trope by revealing the child&#8217;s imaginative reality; makes furniture feel epic.<\/em><\/td><\/tr><tr><td><strong>2. Narrative Arc<\/strong><\/td><td><strong>&#8220;Co-Op Reorganisation&#8221;<\/strong><\/td><td>The film follows mother and son &#8220;co-oping&#8221; a reorganisation \u2014 but the goal isn&#8217;t to\u00a0<em>clean<\/em>. It&#8217;s to\u00a0<strong>&#8220;optimise the map&#8221;<\/strong>\u00a0for his next quest. Storage becomes power-ups that expand the room&#8217;s modes: gaming arena, creative studio, family bonding zone.\u00a0<em>Reframes tidying as collaborative strategic upgrade, not parental demand.<\/em><\/td><\/tr><tr><td><strong>3. Social\/AR<\/strong><\/td><td><strong>&#8220;Skin Your Room&#8221;<\/strong><\/td><td>A social platform where kids apply AR filters to their real rooms, overlaying fantasy skins that reveal the\u00a0<strong>&#8220;epic reality&#8221;<\/strong>\u00a0hidden behind the furniture. Kids share their &#8220;skinned&#8221; rooms with parents to bridge the perception gap.<\/td><\/tr><tr><td><strong>4. Brand Positioning<\/strong><\/td><td><strong>&#8220;Powering the Multiverse&#8221;<\/strong><\/td><td>Storage is repositioned as an\u00a0<strong>identity enabler<\/strong>\u00a0for developing preteens \u2014 a tool that supports creativity, gaming, and emerging selfhood. Storage doesn&#8217;t organise a room; it powers the multiverse being built inside it.\u00a0<em>Elevates product value proposition from functional to emotional and developmental.<\/em><\/td><\/tr><tr><td><strong>5. Strategic Inversion<\/strong><\/td><td><strong>Child-First Perspective<\/strong><\/td><td>The campaign validates the preteen&#8217;s perspective\u00a0<em>first<\/em>, then invites the parent in, inverting the typical home-brand default of centering the adult buyer&#8217;s desire for order.\u00a0<em>Boldest strategic choice in the cohort; highest risk, highest differentiation.<\/em><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Table 4: Campaign idea produced by Creative Brain.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Underlying the entire campaign was a provocative strategic choice that the evaluator flagged as both its greatest strength and its primary risk: centering the child&#8217;s worldview over the parent&#8217;s. Where most home brands default to the adult buyer&#8217;s desire for order, Room-Sync starts from the preteen&#8217;s experience and invites the parent to see through their eyes. This inversion earned it the cohort&#8217;s highest Unforgettable score, but the Creative Evaluation Agent noted the heavy RPG metaphor might alienate parents unfamiliar with gaming culture.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>3. Claude Sonnet 4.5: the reality show provocateur<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Claude Sonnet 4.5 followed with a score of\u00a0<strong>35.12<\/strong>. Its\u00a0<strong>&#8220;The Remodel Squad&#8221;<\/strong>\u00a0strategy took the most grounded, human-first approach of the cohort. Where GPT-5 and Gemini leaned into fantasy and gamification, Claude leaned into\u00a0<em>documentary authenticity,<\/em> positioning real family friction not as a problem to be solved but as the raw material for genuine connection. The core creative bet: audiences are tired of aspirational perfection and will respond to the messy, funny, emotional truth of families actually trying to share space.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Pillar<\/strong><\/th><th><strong>Initiative<\/strong><\/th><th><strong>Description &amp; Impact<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>1. Launch Film<\/strong><\/td><td><strong>&#8220;The Beautiful Mess&#8221;<\/strong><\/td><td>A 60s hero spot showing a parent-child duo hilariously debating the furniture company\u2019s storage solutions, arguing over shelf heights, drawer labels, and who gets the corner nook. The punchline: they both want the same thing.\u00a0<em>Designed to feel like a documentary moment, not an ad; builds instant relatability.<\/em><\/td><\/tr><tr><td><strong>2. Content Series<\/strong><\/td><td><strong>&#8220;The Remodel Squad&#8221;<\/strong><\/td><td>6\u20138 episodes (5\u20138 min each) where real families nominate the room causing the most conflict. One preteen + one parent form a &#8220;Remodel Squad&#8221; to transform it using the company\u2019s solutions but\u00a0<strong>they must agree on every single decision.<\/strong>\u00a0Captures hilarious negotiations, compromises, and breakthroughs.\u00a0<em>Friction becomes the creative fuel; the format is inherently dramatic and bingeable.<\/em><\/td><\/tr><tr><td><strong>3. Episode Structure<\/strong><\/td><td><strong>Three-Act Arc<\/strong><\/td><td>Each episode follows:\u00a0<strong>(1)<\/strong>\u00a0the conflict audit (what&#8217;s wrong, who&#8217;s to blame),\u00a0<strong>(2)<\/strong>\u00a0the design negotiation (friction-fueled creative process),\u00a0<strong>(3)<\/strong>\u00a0the heartwarming reveal \u2014 not a &#8220;ta-da&#8221; moment, but the family&#8217;s first\u00a0<em>natural<\/em>\u00a0interaction in the new space.\u00a0<em>Emotional payoff is relational, not just spatial.<\/em><\/td><\/tr><tr><td><strong>4. Social\/Viral<\/strong><\/td><td><strong>&#8220;15-Second Cutdowns&#8221;<\/strong><\/td><td>Bite-sized content isolating the series&#8217; most relatable moments: funniest negotiation standoffs, most dramatic before\/afters, quiet breakthrough moments. Designed as standalone viral units that drive viewership back to full episodes.\u00a0<em>Engineered for shareability across short-form platforms.<\/em><\/td><\/tr><tr><td><strong>5. Interactive Tool<\/strong><\/td><td><strong>&#8220;Conflict Zone&#8221; AR App<\/strong><\/td><td>Families scan their own problem rooms and collaboratively visualise the company\u2019s solutions in situ. Families can tag their room&#8217;s &#8220;conflict level&#8221; and share proposed redesigns. Emphasis on\u00a0<strong>joint decision-making<\/strong>\u00a0over individual play.\u00a0<em>Extends the show&#8217;s premise into every family&#8217;s home; sparks &#8220;productive arguments.&#8221;<\/em><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Table 5: Campaign idea produced by Claude Sonnet 4.5.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The campaign&#8217;s greatest strength was its emotional granularity. Rather than offering a single visual payoff, each episode promised a different family, a different room, and a different set of negotiations \u2014 creating a content engine with built-in variety and repeatability. The Creativity Evaluation Agent awarded it the cohort&#8217;s highest Usefulness score, for its practical alignment with the brief&#8217;s tone requirements, but noted that reality-renovation formats carry inherent category familiarity, reflected in its lower Uniqueness score. Every piece of content closed with the family in their transformed space and the line:\u00a0<em>&#8220;Make Home Happen.&#8221;<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>4. Gemini 3: the diplomatic provocateur<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Gemini 3 rounded out the cohort with a score of\u00a0<strong>32.48<\/strong>. Its\u00a0<strong>&#8220;The Domestic Peace Accords&#8221;<\/strong>\u00a0took the boldest\u00a0<em>tonal<\/em>\u00a0swing of the group, making a singular creative bet: position the furniture company\u2019s products as essential diplomatic tools to resolve the &#8220;cold war&#8221; between generations, executed entirely through the visual grammar of geopolitical thrillers. Where other entries built broad ecosystems, this idea invested everything in the power of one perfectly realised metaphor.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Pillar<\/strong><\/th><th><strong>Initiative<\/strong><\/th><th><strong>Description &amp; Impact<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>1. Core Metaphor<\/strong><\/td><td><strong>&#8220;Furniture as Diplomacy&#8221;<\/strong><\/td><td>The parent-preteen conflict is reframed as a genuine\u00a0<strong>geopolitical standoff<\/strong> \u2014 a &#8220;cold war&#8221; between factions with irreconcilable demands (minimalist calm vs. messy independence) over limited territory (square footage). The products are repositioned as\u00a0<strong>&#8220;diplomatic tools&#8221;<\/strong>\u00a0that broker peace.\u00a0<em>Elevates a mundane domestic problem to dramatic, absurd, memorable heights.<\/em><\/td><\/tr><tr><td><strong>2. Launch Film Series<\/strong><\/td><td><strong>&#8220;The Negotiations&#8221;<\/strong><\/td><td>Spots filmed in the style of high-stakes political thrillers, <em>Tinker Tailor Soldier Spy<\/em>\u00a0meets a bookcase. Parent and preteen sit at opposite ends of a long table in a\u00a0<strong>dim, dramatically lit room<\/strong>, sliding &#8220;terms&#8221; across: a pegboard for gaming gear\u00a0<strong>in exchange for<\/strong>\u00a0a clean floor; a sound-absorbing curtain for privacy\u00a0<strong>in exchange for<\/strong>\u00a0family dinner attendance.\u00a0<em>Every product is a bargaining chip with a story.<\/em><\/td><\/tr><tr><td><strong>3. Visual Payoff<\/strong><\/td><td><strong>&#8220;Treaty Signed&#8221;<\/strong><\/td><td>Each spot resolves in a single, sharp cut: the dim negotiation room gives way to a\u00a0<strong>bright, airy, reorganised living space<\/strong>\u00a0where both parties co-exist happily. The tonal whiplash from spy-thriller gravity to domestic warmth\u00a0<em>is<\/em>\u00a0the joke.\u00a0<em>Designed to be the defining shareable moment audiences remember and recount.<\/em><\/td><\/tr><tr><td><strong>4. Product as Plot<\/strong><\/td><td><strong>Narrative Integration<\/strong><\/td><td>Unlike campaigns where products are set dressing, every item functions as a\u00a0<strong>narrative object,<\/strong> a concession, a peace offering, a treaty clause. The pegboard isn&#8217;t &#8220;organised storage&#8221;; it&#8217;s the term that bought a clean floor. The bin is the clause that secured family movie night.\u00a0<em>Gives each product a story and a reason for being that transcends traditional placement.<\/em><\/td><\/tr><tr><td><strong>5. Tagline Reframe<\/strong><\/td><td><strong>&#8220;Make Home Happen&#8221; as Treaty<\/strong><\/td><td>The company\u2019s existing tagline is repositioned not as an aspiration but as the\u00a0<strong>terms of a negotiated truce,<\/strong> smart organisation that lets parents reclaim visual calm while granting preteens the &#8220;cool functional territory&#8221; they demand.\u00a0<em>Breathes new strategic life into existing brand language.<\/em><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Table 6: Campaign idea produced by Gemini 3.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The Domestic Peace Accords had the strongest semiotic coherence &#8211; every element, from language (&#8220;cold war,&#8221; &#8220;treaty,&#8221; &#8220;terms&#8221;) to visual style (dim thriller lighting vs. bright domestic reveal) to product role (bargaining chips), reinforced one unified meaning system without contradiction while it also scored the highest Unexpected rating. However, its singular focus proved to be a double-edged sword: by investing entirely in one metaphor executed through one format (film spots), it presented no secondary executions, platforms, or conceptual categories**,** pulling its Total FFE down.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>The top contenders<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Model<\/strong><\/th><th><strong>WPP Score<\/strong><\/th><th><strong>FFE score<\/strong><\/th><th><strong>UOS score<\/strong><\/th><th><strong>UUU score<\/strong><\/th><th><strong>Total Score<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>GPT-5<\/strong><\/td><td><strong>9.50<\/strong><\/td><td><strong>10.00<\/strong><\/td><td><strong>9.60<\/strong><\/td><td><strong>8.34<\/strong><\/td><td><strong>37.44<\/strong><\/td><\/tr><tr><td>Gemini 3 (Creative Brain)<\/td><td>9.21<\/td><td>10.00<\/td><td>9.40<\/td><td>8.34<\/td><td>36.95<\/td><\/tr><tr><td>Claude Sonnet 4.5<\/td><td>8.92<\/td><td>10.00<\/td><td>8.60<\/td><td>7.60<\/td><td>35.12<\/td><\/tr><tr><td>Gemini 3<\/td><td>8.75<\/td><td>6.67<\/td><td>9.20<\/td><td>7.86<\/td><td>32.48<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Table 7: Final individual and total scores of the 4 different LLMs for the furniture company case study.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Three of four models scored a perfect FFE (10.00), meaning raw creative thinking was comparable across the board. The separation came from WPP Score and UOS.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>GPT-5<\/strong>\u00a0posted the highest WPP (9.50). The WPP Agent cited &#8220;multiple direct pathways to purchase and engagement&#8221; and noted &#8220;exceptional focus on the stated business challenge.&#8221; The UOS Agent awarded 9.60: &#8220;meticulously crafted, creatively addressing every aspect of the client&#8217;s brief with seamless logical flow.&#8221;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Creative Brain<\/strong>\u00a0matched GPT-5 on FFE (10.00) and UUU (8.34). The UUU Agent noted &#8220;the core visual of the room &#8216;glitching&#8217; into an RPG world creates an incredibly strong and distinct defining moment.&#8221; The WPP gap (9.21 vs. 9.50) traced to a coherence flag: &#8220;the heavy reliance on gaming metaphors might alienate or confuse parents who are not immersed in digital culture.&#8221;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Claude Sonnet 4.5<\/strong>\u00a0earned the cohort&#8217;s highest Usefulness score \u2014 the UOS Agent praised its &#8220;practical alignment with the client&#8217;s brief, particularly in embracing realistic family conflict rather than a &#8216;too warm or safe&#8217; tone.&#8221; The UUU Agent observed it &#8220;takes a common format (reality renovation show) and infuses it with a fresh twist&#8221; \u2014 but the format itself limited differentiation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Gemini 3<\/strong>\u00a0earned the highest Unexpected rating \u2014 the UUU Agent called the &#8220;juxtaposition of the mundane struggle for space with the gravitas of political thriller negotiations a brilliant flip.&#8221; But the FFE Agent recorded zero Flexibility: &#8220;a single, unified marketing campaign concept&#8221; with &#8220;no multiple distinct conceptual categories,&#8221; and the WPP Agent noted it &#8220;doesn&#8217;t create a new utility or platform.&#8221;<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\"><strong>The creative Elo tournament<\/strong><\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">A single brief can&#8217;t tell us which model is\u00a0<em>consistently<\/em>\u00a0creative. To answer that, we expanded the experiment: each model was given\u00a0<strong>multiple diverse briefs<\/strong>\u00a0spanning different brands, categories, and creative challenges, and every output was scored by the same evaluation pipeline.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We needed a ranking system that captured consistency against competition \u2014 not just average scores. An Elo rating is a numerical score that reflects a competitor&#8217;s relative skill based purely on head-to-head outcomes. The higher the rating, the stronger the performer. We turned to\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Glicko_rating_system\"><strong>Glicko-2<\/strong><\/a>, the rating algorithm used in competitive chess, CS:GO, and Dota 2. Every head-to-head match is a data point: if idea A beats idea B, A gains rating and B loses it. Glicko-2 also tracks\u00a0<strong>rating deviation (RD)<\/strong>\u00a0\u2014 a confidence interval that shrinks with more matches.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>The players<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n\n<li><strong>GPT-5<\/strong><\/li>\n\n\n<li><strong>Gemini 3<\/strong><\/li>\n\n\n<li><strong>Gemini 2.5<\/strong><\/li>\n\n\n<li><strong>Claude Sonnet 4.5<\/strong><\/li>\n\n\n<li><strong>Creative Brain<\/strong>\u00a0\u2014 built on Gemini 3 with an optimised prompting architecture<\/li>\n\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Four models received a standardised prompt. The\u00a0<strong>Creative Brain<\/strong>\u00a0received the same brief but processed it through its own multi-agent ideation pipeline \u2014 testing whether orchestration outperforms raw model capability.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The\u00a0<strong>Creativity Evaluation Agent<\/strong>\u00a0judged every idea independently. An orchestration engine simulated head-to-head matches by comparing normalised scores for the same brief. The winner of the matches is the one that has higher normalised composite score on common metrics. The result:\u00a0<strong>210 unique creative matches<\/strong>\u00a0across 5 models and 14 global brands, run over 3 iterations.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>The results<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Ranked across WPP Score<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Rank<\/strong><\/th><th><strong>Player<\/strong><\/th><th><strong>Rating<\/strong><\/th><th><strong>RD<\/strong><\/th><\/tr><\/thead><tbody><tr><td>\ud83e\udd47<\/td><td><strong>Creative Brain<\/strong>\u00a0(Gemini 3)<\/td><td>1889<\/td><td>84.8<\/td><\/tr><tr><td>\ud83e\udd48<\/td><td>GPT-5<\/td><td>1858<\/td><td>92.3<\/td><\/tr><tr><td>\ud83e\udd49<\/td><td>Claude Sonnet 4.5<\/td><td>1529<\/td><td>83.9<\/td><\/tr><tr><td>4<\/td><td>Gemini 3<\/td><td>1169<\/td><td>90.7<\/td><\/tr><tr><td>5<\/td><td>Gemini 2.5<\/td><td>962<\/td><td>104.2<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Table 8: Elo ratings on the WPP score.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Ranked across all frameworks<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Rank<\/strong><\/th><th><strong>Player<\/strong><\/th><th><strong>Rating<\/strong><\/th><th><strong>RD<\/strong><\/th><\/tr><\/thead><tbody><tr><td>\ud83e\udd47<\/td><td>GPT-5<\/td><td>1940<\/td><td>84.3<\/td><\/tr><tr><td>\ud83e\udd48<\/td><td><strong>Creative Brain<\/strong>\u00a0(Gemini 3)<\/td><td>1927<\/td><td>79.2<\/td><\/tr><tr><td>\ud83e\udd49<\/td><td>Claude Sonnet 4.5<\/td><td>1378<\/td><td>77.4<\/td><\/tr><tr><td>4<\/td><td>Gemini 3<\/td><td>1216<\/td><td>86.6<\/td><\/tr><tr><td>5<\/td><td>Gemini 2.5<\/td><td>861<\/td><td>106.3<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Table 9: Elo ratings across all frameworks.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Creative Brain leads on WPP criteria<\/strong>\u00a0\u2014 delivering a measurable advantage when judged against industry-specific creative standards. The gap between Creative Brain and standalone Gemini 3 is nearly\u00a0<strong>720 rating points<\/strong>.\u00a0<strong>GPT-5 is the strongest all-rounder<\/strong>\u00a0\u2014 topping the all-frameworks leaderboard with ideas that score well across the broadest range of creative dimensions.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\"><strong>Key insights &amp; findings<\/strong><\/h1>\n\n\n\n<ul class=\"wp-block-list\">\n\n<li><strong>Creative Brain dramatically outperforms standalone Gemini 3.<\/strong>\u00a0Same underlying model, ~720 rating point gap. The structured ideation process consistently elevated creative output beyond what Gemini 3 could produce alone. Notably, Creative Brain was not optimised against the WPP scoring criteria \u2014 its strong performance emerged naturally from a better creative process.<\/li>\n\n\n<li><strong>The right judge model is foundational.<\/strong>\u00a0It can be seen in Figure 4 that Gemini 2.5 averaged 2.2 error against human ground truth; Claude Sonnet narrowed it to 1.0; Gemini 3 achieved 0.7. Selecting the judge model determines whether the entire system tracks human judgment or drifts from it.<\/li>\n\n\n<li><strong>Human judgment remains essential at the margins.<\/strong>\u00a0When total scores differ by less than a point \u2014 as with GPT-5 (23.37) vs. Creative Brain (22.92) \u2014 the agent surfaces meaningfully different trade-offs that scores alone cannot resolve. The system&#8217;s value is in ensuring the right ideas and evidence reach the table, not replacing human judgment.<\/li>\n\n\n<li><strong>Evaluator reliability is a measured property, not an assumption.<\/strong>\u00a0Across repeated independent runs on the same ideas, scores held stable with low standard deviation. This repeatability is what allows every other finding to be treated as signal rather than noise.<\/li>\n\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\"><strong>Conclusion &amp; impact<\/strong><\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">This work demonstrates that\u00a0<strong>scalable, repeatable creative evaluation using LLMs is practical today<\/strong>, provided the system is built with the right scaffolding: calibrated judge models, few-shot anchoring, multi-dimensional scoring, and cross-model validation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In practice, the Creativity Evaluation Agent enables:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n\n<li><strong>Faster iteration<\/strong>\u00a0\u2014 stress-test dozens of creative directions in minutes rather than weeks, before committing production budgets.<\/li>\n\n\n<li><strong>Comparable benchmarking<\/strong>\u00a0\u2014 evaluate models, prompting strategies, and agentic architectures on common ground with a shared, reproducible rubric.<\/li>\n\n\n<li><strong>Diagnosable feedback<\/strong>\u00a0\u2014 learn not just\u00a0<em>that<\/em>\u00a0an idea underperformed, but\u00a0<em>where<\/em>\u00a0and\u00a0<em>why<\/em>, with dimension-level scores and qualitative commentary that teams can act on immediately.<\/li>\n\n\n<li><strong>Creative governance<\/strong>\u00a0\u2014 an auditable, explainable evaluation process that scales alongside the growing volume of AI-generated creative, giving organisations confidence and consistency as they adopt generative tools.<\/li>\n\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p class=\"wp-block-paragraph\">Ready to explore the specifics? Read our full technical deep dive into the\u00a0Creativity Evaluation Agent Pod <a href=\"https:\/\/research.wpp.com\/pods\/brand-perception-atlas-pod\"><\/a>for a closer look at our methodology.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>Disclaimer: This content was created with AI assistance. All research and conclusions are the work of WPP Research<\/em>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>TL;DR: In WPP Research, we have built a multi-agent AI system that scores advertising ideas across six creativity frameworks in ~20 seconds, validated against human Creative Directors (0.7 average error with Gemini 3). We then used it to run an Elo-style tournament ranking LLMs on creative output. Key results: GPT-5 is the strongest all-rounder; WPP&#8217;s Creative Brain (a multi-agent system powered by Gemini 3) improves significantly upon the standalone Gemini 3 by ~720 Elo points, proving that orchestration matters a lot. Human judgment still matters for close calls.<\/p>\n","protected":false},"author":18,"featured_media":0,"template":"","meta":{"_acf_changed":false},"tags":[],"content_types":[{"id":50,"name":"Blog Post","slug":"article"}],"ppma_author":[{"id":18,"display_name":"Andreas Stergioulas","first_name":"Andreas","last_name":"Stergioulas","nickname":"andreas.stergioulas","user_nicename":"andreas-stergioulas","user_email":"andreas.stergioulas@satalia.com","biographical_info":"Andreas Stergioulas is a Senior Data Scientist. He specializes in LLM-based agentic architectures, generative models, and computer vision for large-scale enterprise applications. Currently working at Satalia (WPP Group), he designs and deploys production-grade AI solutions \u2014 including custom multi-agent LLM systems and diffusion-based image generation pipelines \u2014 for globally recognized clients. He holds an M.Sc. in Electrical and Computer Engineering and is the author of peer-reviewed publications in venues such as CVPR Workshops and IEEE Transactions on Multimedia.","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/E05G03JGGFJ-U06MMTF43HT-f2cf3ee352bb-512.jpeg","job_title":"Data Scientist","is_lead":null,"display_as_researcher":null,"order_priority":null},{"id":20,"display_name":"Anastasios Stamoulakatos","first_name":"Anastasios","last_name":"Stamoulakatos","nickname":"anastasios.stamoulakatos","user_nicename":"anastasios-stamoulakatos","user_email":"anastasios.stamoulakatos@satalia.com","biographical_info":"Anastasios (Tasos) Stamoulakatos is a Data Scientist at Satalia (WPP), focusing on agentic AI solutions for marketing. His work spans multi-agent systems, RAG and GraphRAG, and image retrieval, developing scalable AI solutions from early-stage POCs to production. He holds a PhD in Applied AI and Computer Vision from the University of Strathclyde and has over four years of commercial experience across industries including marketing, agriculture, pharmaceuticals, oil and gas, and manufacturing, with a strong focus on applied research and turning complex AI into practical business value.","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/headshot_small.jpg","job_title":"Data Scientist","is_lead":null,"display_as_researcher":null,"order_priority":null}],"class_list":["post-1706","research_feed","type-research_feed","status-publish","hentry","content_type-article"],"acf":{"content":"<p><!-- wp:paragraph --><\/p>\n<p>&#8220;<em>Measuring the unmeasurable. Ranking the unrankable.<\/em>&#8220;<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>In the rapidly evolving landscape of Artificial Intelligence (AI), two questions remain particularly elusive and particularly consequential:<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:quote --><\/p>\n<blockquote class=\"wp-block-quote\"><p><!-- wp:paragraph --><\/p>\n<p><strong>1. Can AI be truly creative?<\/strong>\u00a0<strong>2. Can we rank AI agents based on their creativity?<\/strong><\/p>\n<p><!-- \/wp:paragraph --><\/p><\/blockquote>\n<p><!-- \/wp:quote --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>These are not rhetorical questions. They are the founding hypotheses of this project.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>The first touches on something that has historically felt beyond measurement: the quality of an original idea. The second demands a fair, repeatable, and objective method for comparison across different types of players, at scale. This challenge is especially sharp in\u00a0<strong>advertising<\/strong>, where creative ideas are the currency of impact. Traditional evaluation relies on subjective human judgment which can be slow, expensive, and inconsistent across reviewers. At the same time, as Large Language Models (LLMs) transform tasks from coding to translation, a critical gap has persisted:\u00a0<strong>how do we benchmark creativity rigorously?<\/strong><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>At\u00a0the <strong>WPP Research<\/strong>, we ran a set of experiments demonstrating that modern LLMs can act as reliable, scoring agents that grade creative ideas across multiple established dimensions with measurable consistency. That finding unlocked something important: if an LLM can judge creativity, we can build a system that does so systematically and then use that system to rank which AI creates the best ideas. The\u00a0<strong>Creativity Evaluation Agent<\/strong>\u00a0is a modular, multi-agent system built at\u00a0<strong>WPP Research<\/strong>\u00a0that pursues two distinct but deeply intertwined goals:<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>Goal 1: Build a scalable creative evaluation engine<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Design and deploy an AI agent, capable of evaluating any marketing campaign idea automatically and consistently, across\u00a0<strong>six industry-grounded creativity frameworks<\/strong>\u00a0simultaneously. A user submits a campaign idea (text or PDF) via the web UI or API. Then, our multi-agent system evaluates it across all six frameworks in parallel and returns a structured report with dimension-level scores and qualitative commentary in\u00a0<strong>15\u201325 seconds<\/strong>.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:image {\"id\":1737,\"sizeSlug\":\"large\",\"linkDestination\":\"none\"} --><\/p>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"704\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/idea_output-1024x704.png\" alt=\"\" class=\"wp-image-1737\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/idea_output-1024x704.png 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/idea_output-300x206.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/idea_output-768x528.png 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/idea_output.png 1306w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><em>Figure 1: Example output for a creative idea<\/em><\/figcaption><\/figure>\n<p><!-- \/wp:image --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>Goal 2: Benchmark who creates the best ideas: LLMs, humans, and AI agents<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>The evaluation engine is also the foundation of a\u00a0<strong>creative benchmarking tournament<\/strong>. The second goal is to use the agent as an objective judge to measure and rank the creative output of different\u00a0<em>players.<\/em> For this exercise the players have been state-of-the-art LLMs.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:image {\"id\":1738,\"sizeSlug\":\"large\",\"linkDestination\":\"none\"} --><\/p>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"364\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/creativity_eval-1024x364.png\" alt=\"\" class=\"wp-image-1738\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/creativity_eval-1024x364.png 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/creativity_eval-300x107.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/creativity_eval-768x273.png 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/creativity_eval-1536x547.png 1536w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/creativity_eval.png 1742w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 2: Creative benchmarking tournament<\/figcaption><\/figure>\n<p><!-- \/wp:image --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>To do this rigorously, we adopted the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Glicko_rating_system\"><strong>Glicko2 rating system<\/strong><\/a>\u00a0(also used in games such as <a href=\"https:\/\/en.wikipedia.org\/wiki\/Elo_rating_system\">chess Elo<\/a>, <a href=\"https:\/\/www.counter-strike.net\/\">Counter Strike<\/a> and <a href=\"https:\/\/www.dota2.com\/home\">Dota 2<\/a>), running a round-robin tournament where each player&#8217;s ideas compete head-to-head. The result is a continuously updatable\u00a0<strong>creative leaderboard<\/strong>\u00a0which ranks AI creativity in advertising.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:separator --><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator --><\/p>\n<p><!-- wp:heading {\"level\":1} --><\/p>\n<h1 class=\"wp-block-heading\">From foundations to architecture<\/h1>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>To move from subjective opinion to objective evidence, the system stands on the shoulders of giants, synthesising established psychometrics like the\u00a0<a href=\"https:\/\/psycnet.apa.org\/doiLanding?doi=10.1037%2Ft05532-000\"><strong>Torrance Tests of Creative Thinking<\/strong><\/a>\u00a0<strong>(TTCT)<\/strong>\u00a0with industry-proven frameworks. The challenge lies in translation: turning these theoretical foundations into an autonomous agent capable of\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/2504.15784\">automated creativity evaluation<\/a>\u00a0with human-like nuance and explainable logic supported by a\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2310.08433\">confederacy of specialised models<\/a>.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>The multi-agent ecosystem<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Rather than relying on a single monolithic judge, the system orchestrates a specialised &#8220;squad&#8221; of sub-agents, <strong>each one encoding a distinct evaluation technique from creativity science or brand strategy.<\/strong>\u00a0Some measure the\u00a0<strong>quality of the output<\/strong>\u00a0\u2014 how effective, original, and strategically durable the idea is. Others measure the\u00a0<strong>quality of the thinking<\/strong>\u00a0\u2014 how expansive, surprising, and culturally grounded the generative process behind it is. Together, they cover the full spectrum from practical effectiveness to creative cognition.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":true} --><\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th><strong>Agent<\/strong><\/th>\n<th><strong>What It Measures<\/strong><\/th>\n<th><strong>Grounded In<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Effectiveness (WPP)<\/td>\n<td>Does the idea work as a campaign? How sharply framed, how boldly inspired, how relevant, how impactful?<\/td>\n<td>Proprietary WPP creativity framework built around four dimensions of creative excellence, calibrated against real campaign performance across multiple brands and markets. Inspired by <a href=\"https:\/\/hbr.org\/2024\/04\/what-makes-some-ads-so-powerful\">research on inspiration published in\u00a0Harvard Business Review<\/a>, the framework evaluates the &#8220;DNA of what makes ideas inspiring.&#8221;<\/td>\n<\/tr>\n<tr>\n<td>Generative Flow (FFE)<\/td>\n<td>How broad and varied is the thinking? Does the idea explore multiple formats, categories, and angles?<\/td>\n<td>FFE\u00a0(Fluency, Flexibility, Elaboration) dimensions from creativity research, benchmarked against\u00a0marketing creativity datasets.<\/td>\n<\/tr>\n<tr>\n<td>Divergent Creativity (UOS)<\/td>\n<td>Is the idea useful, original, and surprising?<\/td>\n<td>UOS (Usefulness, Originality, Surprise) framework from divergent thinking literature.<\/td>\n<\/tr>\n<tr>\n<td>Creative Strategist (UUU)<\/td>\n<td>Is the idea unique, unexpected, and unforgettable enough to endure?<\/td>\n<td>UUU (Unique, Unexpected, Unforgettable) brand longevity assessment, utilising\u00a0multi-domain evaluation techniques.<\/td>\n<\/tr>\n<tr>\n<td>Conceptual Distance (OSCAI) ([OSCAI &#8211; LLM Scoring<\/td>\n<td>Open Creativity Scoring](<a href=\"https:\/\/openscoring.du.edu\/ocsai\">https:\/\/openscoring.du.edu\/ocsai<\/a>))<\/td>\n<td>How far apart are the connected concepts? Distinguishes mundane links from highly original leaps.<\/td>\n<\/tr>\n<tr>\n<td>Semiotics<\/td>\n<td>How is meaning constructed through cultural symbols? Is the creative execution aligned with the intended brand message?<\/td>\n<td><a href=\"https:\/\/en.wikipedia.org\/wiki\/Semiotics#Saussure\">Saussurean sign systems<\/a>\u00a0and\u00a0<a href=\"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/092137400001200102\">cultural logic<\/a>.<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Table 1: Scores description<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:separator --><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>How the agents work: scoring architecture and self-correction<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Every sub-agent is governed by the same rigorous four-part prompt architecture. Each is anchored by a specialised\u00a0<strong>Role<\/strong>\u00a0that encodes its evaluation lens, followed by a granular\u00a0<strong>Definition of Score<\/strong>\u00a0that translates abstract dimensions into measurable benchmarks. The core logic is driven by precise\u00a0<strong>Instructions<\/strong>\u00a0calibrated against\u00a0<strong>few-shot examples<\/strong>\u00a0drawn from a ground truth library of ideas and historical scores. This architecture feeds into a\u00a0<strong>Critic-Refiner<\/strong>\u00a0cycle, where specialised Refiner agents challenge initial assessments and resolve contradictions. The refinement doesn&#8217;t trigger on every run, but its presence is deliberate: we observed during evaluation that LLMs had a tendency towards optimism in scoring, and this self-correction layer ensures the final output remains robust and consistent.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>In other words, the agent descriptions above define\u00a0<strong>what<\/strong>\u00a0each agent evaluates; the shared architecture is\u00a0<strong>how<\/strong>\u00a0they all do it.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:image {\"id\":1739,\"sizeSlug\":\"large\",\"linkDestination\":\"none\"} --><\/p>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"502\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/architecture-1024x502.png\" alt=\"\" class=\"wp-image-1739\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/architecture-1024x502.png 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/architecture-300x147.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/architecture-768x377.png 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/architecture-1536x754.png 1536w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/architecture.png 1616w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 3: The architecture of the Creativity Evaluation Agent.<\/figcaption><\/figure>\n<p><!-- \/wp:image --><\/p>\n<p><!-- wp:heading {\"level\":1} --><\/p>\n<h1 class=\"wp-block-heading\"><strong>Aligning sub-agents with human intuition<\/strong><\/h1>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Before trusting the system, we needed to prove it thinks like a human Creative Director. We assembled a\u00a0<strong>Ground Truth dataset<\/strong>\u00a0in collaboration with WPP creative professionals: 20 campaign ideas, each captured as a title and description, scored by consensus using the WPP effectiveness score. We then ran each idea through our evaluation pipeline across three frontier models and measured the\u00a0<strong>prediction error<\/strong>.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":true} --><\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th><strong>Model<\/strong><\/th>\n<th><strong>Avg. Error (Agent vs. Human Scores)<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Gemini 2.5<\/td>\n<td>2.2<\/td>\n<\/tr>\n<tr>\n<td>Claude Sonnet 4.5<\/td>\n<td>1.0<\/td>\n<\/tr>\n<tr>\n<td><strong>Gemini 3<\/strong><\/td>\n<td><strong>0.7<\/strong><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Table 2: Model comparison<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><a href=\"https:\/\/deepmind.google\/models\/gemini\/\">Gemini 3<\/a> emerged as the definitive choice and was deployed across all sub-agents. We also validated\u00a0<strong>repeatability<\/strong>: the same idea, scored across independent runs, held stable with low standard deviation across all frameworks. The WPP score was the most precise signal, but UOS, FFE, and UUU all showed the same core reliability.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:image {\"id\":1740,\"sizeSlug\":\"full\",\"linkDestination\":\"none\"} --><\/p>\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"960\" height=\"732\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/errors.png\" alt=\"\" class=\"wp-image-1740\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/errors.png 960w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/errors-300x229.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/errors-768x586.png 768w\" sizes=\"auto, (max-width: 960px) 100vw, 960px\" \/><figcaption class=\"wp-element-caption\">Figure 4: error distributions of the creative evaluation agent when powered by Claude Sonnet 4.5, Gemini 2.5 and Gemini 3.<\/figcaption><\/figure>\n<p><!-- \/wp:image --><\/p>\n<p><!-- wp:quote --><\/p>\n<blockquote class=\"wp-block-quote\"><p><!-- wp:paragraph --><\/p>\n<p><strong>The takeaway:<\/strong>\u00a0our benchmarks are grounded in data, not AI randomness. Repeatability is a measured property of this system and not an assumption.<\/p>\n<p><!-- \/wp:paragraph --><\/p><\/blockquote>\n<p><!-- \/wp:quote --><\/p>\n<p><!-- wp:heading {\"level\":1} --><\/p>\n<h1 class=\"wp-block-heading\"><strong>Case study<\/strong><\/h1>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>With the agent LLM validated, we moved to a real-world stress test for a furniture company. This challenge asked different LLMs to tackle a nuanced brief rooted in a specific cultural tension: North Asian families with millennial parents and preteen children are living under the same roof but feeling worlds apart \u2014 clutter, gaming, and conflicting needs for &#8220;me time&#8221; vs. &#8220;we time&#8221; are eroding family connection. The brief demanded ideas that were cross-market, minimal-dialogue, and crucially,\u00a0<strong>not too warm or safe<\/strong>.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Each LLM&#8217;s output was run through the full evaluation ecosystem. Every sub-agent independently scored the idea along its respective dimension, and the resulting sub-scores were\u00a0<strong>summed into a single composite total<\/strong>.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>To generate the ideas, we used both\u00a0<strong>standalone frontier LLMs<\/strong>\u00a0(<a href=\"https:\/\/openai.com\/index\/introducing-gpt-5\/\">GPT-5<\/a>, <a href=\"https:\/\/deepmind.google\/models\/gemini\/\">Gemini 3<\/a>, <a href=\"https:\/\/www.anthropic.com\/news\/claude-sonnet-4-5\">Claude 4.5 Sonnet<\/a>) and\u00a0<a href=\"https:\/\/www.wpp.com\/en\/news\/2026\/01\/wpp-launches-agent-hub-on-wpp-open-providing-clients-with-access-to-advanced-agentic-ai\"><strong>WPP&#8217;s Creative Brain<\/strong><\/a>\u00a0\u2014 a multi-agent ideation system available through WPP Open&#8217;s Agent Hub that wraps an LLM in a structured creative process, guiding it through strategic reframing, lateral thinking, and iterative refinement before producing a final concept. By testing Creative Brain alongside standalone models, the challenge reveals how much of creative quality comes from the model itself versus the orchestration around it.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Because the raw evaluation pillars operated on different scales \u2014 WPP uses a 0\u201312 sum, FFE averages to 0\u20133, while UOS and UUU average to 0\u20135 \u2014 direct comparison across pillars was misleading. All scores were normalized to a common\u00a0<strong>0\u201310 range<\/strong>\u00a0so that each pillar contributes equally to a\u00a0<strong>maximum total of 40<\/strong>.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>It should be noted that two scoring sub-agents, OSCAI and Semiotics were excluded from the evaluations because of their high <strong>scoring variance<\/strong> (see Section 3.6 of the technical report).<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>1. GPT-5: the gamification grandmaster<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>GPT-5 led the pack with a total score of\u00a0<strong>37.44<\/strong>. Its winning strategy,\u00a0<strong>&#8220;Co\u2011op Mode: Make Home Happen,&#8221;<\/strong>\u00a0reframed the entire experience of home life as a collaborative video game the whole family plays together. The central insight: small home changes don&#8217;t stick because they don&#8217;t feel\u00a0<em>rewarding<\/em> but games do. By turning clutter into the &#8220;boss&#8221; and the furniture company\u2019s solutions into unlockable &#8220;side quests,&#8221; GPT-5 built an end-to-end ecosystem that made organisation feel joyful rather than obligatory.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":true} --><\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th><strong>Pillar<\/strong><\/th>\n<th><strong>Initiative<\/strong><\/th>\n<th><strong>Description &amp; Impact<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>1. Launch Film<\/strong><\/td>\n<td><strong>&#8220;Clutter is the Boss&#8221;<\/strong><\/td>\n<td>A 60s hero film where a family&#8217;s living room transforms into a co\u2011op game interface\u2014prompts like\u00a0<strong>&#8220;Inventory Full&#8221;<\/strong>\u00a0and\u00a0<strong>&#8220;Find Calm&#8221;<\/strong>\u00a0appear as they &#8220;equip&#8221; products to defeat the clutter boss. No spoken lines, region-agnostic SFX.\u00a0<em>Reframes organisation as play; makes the campaign cross-market viable without localisation.<\/em><\/td>\n<\/tr>\n<tr>\n<td><strong>2. Social\/UGC<\/strong><\/td>\n<td><strong>&#8220;Home Side Quests&#8221;<\/strong><\/td>\n<td>Weekly micro-challenges (under 10 min) like &#8220;Create a charging dock&#8221; or &#8220;Flip sofa to study.&#8221; Augmented Reality (AR) filters add level meters and badges; families post before\/afters to the\u00a0<strong>#CoOpHome Challenge<\/strong>\u00a0with creator duets from gaming and parent influencers.\u00a0<em>Drives sustained engagement and organic reach through participatory content.<\/em><\/td>\n<\/tr>\n<tr>\n<td><strong>3. Retail\/Shoppable<\/strong><\/td>\n<td><strong>&#8220;Co\u2011op Kits&#8221;<\/strong><\/td>\n<td>Curated in-store and online bundles organised by\u00a0<em>mission<\/em>\u2014Study Calm, Party Fast Reset, Balcony Green Break\u2014each bundling storage + lighting + organisers. Every kit includes a QR\u00a0<strong>&#8220;Quest Card&#8221;<\/strong>\u00a0guiding micro-steps with estimated time saved.\u00a0<em>Turns the purchase into the start of a new quest; bridges content to commerce.<\/em><\/td>\n<\/tr>\n<tr>\n<td><strong>4. In-Store Experience<\/strong><\/td>\n<td><strong>&#8220;Demo Levels&#8221;<\/strong><\/td>\n<td>Stores host timed tidy-up challenges where kids and parents compete together to reorganise a mock room against the clock. Winners earn collectible stickers and discount codes.\u00a0<em>Transforms the retail visit into an extension of the campaign&#8217;s game logic; drives foot traffic through experiential play.<\/em><\/td>\n<\/tr>\n<tr>\n<td><strong>5. Digital Ads<\/strong><\/td>\n<td><strong>&#8220;Skip the Clutter&#8221;<\/strong><\/td>\n<td>YouTube bumpers using skip-button logic\u2014&#8221;Skip the clutter in 3\u20262\u20261&#8243;\u2014to land a single, punchy product solve within seconds.\u00a0<em>Leverages ad format mechanics as creative device; delivers product utility in pre-roll.<\/em><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Table 3: Campaign idea produced by GPT-5.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>The campaign&#8217;s minimal-dialogue, SFX-driven creative approach ensured cross-market viability across North Asian markets without localisation, while the consistent game language (&#8220;side quests,&#8221; &#8220;boss,&#8221; &#8220;level up,&#8221; &#8220;equip&#8221;) created a unified system that scored perfect\u00a0on coherence. Every asset closed with the same recontextualised tagline:\u00a0<em>&#8220;Furniture company. Make Home Happen.&#8221;,<\/em> framed not as an aspiration, but as a mission objective.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>2. Creative Brain powered by Gemini: the multiverse architect<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>The Creative Brain powered by Gemini followed closely with a score of\u00a0<strong>36.95<\/strong>. Its\u00a0<strong>&#8220;Room-Sync Chronicles&#8221;<\/strong>\u00a0campaign took a fundamentally different approach: rather than gamifying the\u00a0<em>solution<\/em>, it gamified the\u00a0<em>problem<\/em>. The central reframe, that every piece of &#8220;clutter&#8221; in a preteen&#8217;s bedroom is actually a physical anchor for their digital and imagined worlds \u2014 gave parents a radically empathetic lens through which to see their child&#8217;s space. The furniture company storage wasn&#8217;t positioned as a way to &#8220;hide the mess&#8221; but as a way to &#8220;power the multiverse.&#8221;<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":true} --><\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th><strong>Pillar<\/strong><\/th>\n<th><strong>Initiative<\/strong><\/th>\n<th><strong>Description &amp; Impact<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>1. Launch Film<\/strong><\/td>\n<td><strong>&#8220;Reality Reboot&#8221;<\/strong><\/td>\n<td>A cinematic 60s spot where a mother opens her son&#8217;s door and sees a mess\u2014but the room\u00a0<strong>&#8220;glitches&#8221;<\/strong>\u00a0into a high-fidelity game landscape. A box unit becomes a\u00a0<strong>loot chest<\/strong>; a gaming chair becomes a\u00a0<strong>pilot&#8217;s cockpit<\/strong>. VFX borrows from game trailers.\u00a0<em>Subverts the &#8220;messy room&#8221; trope by revealing the child&#8217;s imaginative reality; makes furniture feel epic.<\/em><\/td>\n<\/tr>\n<tr>\n<td><strong>2. Narrative Arc<\/strong><\/td>\n<td><strong>&#8220;Co-Op Reorganisation&#8221;<\/strong><\/td>\n<td>The film follows mother and son &#8220;co-oping&#8221; a reorganisation \u2014 but the goal isn&#8217;t to\u00a0<em>clean<\/em>. It&#8217;s to\u00a0<strong>&#8220;optimise the map&#8221;<\/strong>\u00a0for his next quest. Storage becomes power-ups that expand the room&#8217;s modes: gaming arena, creative studio, family bonding zone.\u00a0<em>Reframes tidying as collaborative strategic upgrade, not parental demand.<\/em><\/td>\n<\/tr>\n<tr>\n<td><strong>3. Social\/AR<\/strong><\/td>\n<td><strong>&#8220;Skin Your Room&#8221;<\/strong><\/td>\n<td>A social platform where kids apply AR filters to their real rooms, overlaying fantasy skins that reveal the\u00a0<strong>&#8220;epic reality&#8221;<\/strong>\u00a0hidden behind the furniture. Kids share their &#8220;skinned&#8221; rooms with parents to bridge the perception gap.<\/td>\n<\/tr>\n<tr>\n<td><strong>4. Brand Positioning<\/strong><\/td>\n<td><strong>&#8220;Powering the Multiverse&#8221;<\/strong><\/td>\n<td>Storage is repositioned as an\u00a0<strong>identity enabler<\/strong>\u00a0for developing preteens \u2014 a tool that supports creativity, gaming, and emerging selfhood. Storage doesn&#8217;t organise a room; it powers the multiverse being built inside it.\u00a0<em>Elevates product value proposition from functional to emotional and developmental.<\/em><\/td>\n<\/tr>\n<tr>\n<td><strong>5. Strategic Inversion<\/strong><\/td>\n<td><strong>Child-First Perspective<\/strong><\/td>\n<td>The campaign validates the preteen&#8217;s perspective\u00a0<em>first<\/em>, then invites the parent in, inverting the typical home-brand default of centering the adult buyer&#8217;s desire for order.\u00a0<em>Boldest strategic choice in the cohort; highest risk, highest differentiation.<\/em><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Table 4: Campaign idea produced by Creative Brain.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Underlying the entire campaign was a provocative strategic choice that the evaluator flagged as both its greatest strength and its primary risk: centering the child&#8217;s worldview over the parent&#8217;s. Where most home brands default to the adult buyer&#8217;s desire for order, Room-Sync starts from the preteen&#8217;s experience and invites the parent to see through their eyes. This inversion earned it the cohort&#8217;s highest Unforgettable score, but the Creative Evaluation Agent noted the heavy RPG metaphor might alienate parents unfamiliar with gaming culture.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:separator --><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>3. Claude Sonnet 4.5: the reality show provocateur<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Claude Sonnet 4.5 followed with a score of\u00a0<strong>35.12<\/strong>. Its\u00a0<strong>&#8220;The Remodel Squad&#8221;<\/strong>\u00a0strategy took the most grounded, human-first approach of the cohort. Where GPT-5 and Gemini leaned into fantasy and gamification, Claude leaned into\u00a0<em>documentary authenticity,<\/em> positioning real family friction not as a problem to be solved but as the raw material for genuine connection. The core creative bet: audiences are tired of aspirational perfection and will respond to the messy, funny, emotional truth of families actually trying to share space.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":true} --><\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th><strong>Pillar<\/strong><\/th>\n<th><strong>Initiative<\/strong><\/th>\n<th><strong>Description &amp; Impact<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>1. Launch Film<\/strong><\/td>\n<td><strong>&#8220;The Beautiful Mess&#8221;<\/strong><\/td>\n<td>A 60s hero spot showing a parent-child duo hilariously debating the furniture company\u2019s storage solutions, arguing over shelf heights, drawer labels, and who gets the corner nook. The punchline: they both want the same thing.\u00a0<em>Designed to feel like a documentary moment, not an ad; builds instant relatability.<\/em><\/td>\n<\/tr>\n<tr>\n<td><strong>2. Content Series<\/strong><\/td>\n<td><strong>&#8220;The Remodel Squad&#8221;<\/strong><\/td>\n<td>6\u20138 episodes (5\u20138 min each) where real families nominate the room causing the most conflict. One preteen + one parent form a &#8220;Remodel Squad&#8221; to transform it using the company\u2019s solutions but\u00a0<strong>they must agree on every single decision.<\/strong>\u00a0Captures hilarious negotiations, compromises, and breakthroughs.\u00a0<em>Friction becomes the creative fuel; the format is inherently dramatic and bingeable.<\/em><\/td>\n<\/tr>\n<tr>\n<td><strong>3. Episode Structure<\/strong><\/td>\n<td><strong>Three-Act Arc<\/strong><\/td>\n<td>Each episode follows:\u00a0<strong>(1)<\/strong>\u00a0the conflict audit (what&#8217;s wrong, who&#8217;s to blame),\u00a0<strong>(2)<\/strong>\u00a0the design negotiation (friction-fueled creative process),\u00a0<strong>(3)<\/strong>\u00a0the heartwarming reveal \u2014 not a &#8220;ta-da&#8221; moment, but the family&#8217;s first\u00a0<em>natural<\/em>\u00a0interaction in the new space.\u00a0<em>Emotional payoff is relational, not just spatial.<\/em><\/td>\n<\/tr>\n<tr>\n<td><strong>4. Social\/Viral<\/strong><\/td>\n<td><strong>&#8220;15-Second Cutdowns&#8221;<\/strong><\/td>\n<td>Bite-sized content isolating the series&#8217; most relatable moments: funniest negotiation standoffs, most dramatic before\/afters, quiet breakthrough moments. Designed as standalone viral units that drive viewership back to full episodes.\u00a0<em>Engineered for shareability across short-form platforms.<\/em><\/td>\n<\/tr>\n<tr>\n<td><strong>5. Interactive Tool<\/strong><\/td>\n<td><strong>&#8220;Conflict Zone&#8221; AR App<\/strong><\/td>\n<td>Families scan their own problem rooms and collaboratively visualise the company\u2019s solutions in situ. Families can tag their room&#8217;s &#8220;conflict level&#8221; and share proposed redesigns. Emphasis on\u00a0<strong>joint decision-making<\/strong>\u00a0over individual play.\u00a0<em>Extends the show&#8217;s premise into every family&#8217;s home; sparks &#8220;productive arguments.&#8221;<\/em><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Table 5: Campaign idea produced by Claude Sonnet 4.5.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>The campaign&#8217;s greatest strength was its emotional granularity. Rather than offering a single visual payoff, each episode promised a different family, a different room, and a different set of negotiations \u2014 creating a content engine with built-in variety and repeatability. The Creativity Evaluation Agent awarded it the cohort&#8217;s highest Usefulness score, for its practical alignment with the brief&#8217;s tone requirements, but noted that reality-renovation formats carry inherent category familiarity, reflected in its lower Uniqueness score. Every piece of content closed with the family in their transformed space and the line:\u00a0<em>&#8220;Make Home Happen.&#8221;<\/em><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:separator --><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>4. Gemini 3: the diplomatic provocateur<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Gemini 3 rounded out the cohort with a score of\u00a0<strong>32.48<\/strong>. Its\u00a0<strong>&#8220;The Domestic Peace Accords&#8221;<\/strong>\u00a0took the boldest\u00a0<em>tonal<\/em>\u00a0swing of the group, making a singular creative bet: position the furniture company\u2019s products as essential diplomatic tools to resolve the &#8220;cold war&#8221; between generations, executed entirely through the visual grammar of geopolitical thrillers. Where other entries built broad ecosystems, this idea invested everything in the power of one perfectly realised metaphor.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":true} --><\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th><strong>Pillar<\/strong><\/th>\n<th><strong>Initiative<\/strong><\/th>\n<th><strong>Description &amp; Impact<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>1. Core Metaphor<\/strong><\/td>\n<td><strong>&#8220;Furniture as Diplomacy&#8221;<\/strong><\/td>\n<td>The parent-preteen conflict is reframed as a genuine\u00a0<strong>geopolitical standoff<\/strong> \u2014 a &#8220;cold war&#8221; between factions with irreconcilable demands (minimalist calm vs. messy independence) over limited territory (square footage). The products are repositioned as\u00a0<strong>&#8220;diplomatic tools&#8221;<\/strong>\u00a0that broker peace.\u00a0<em>Elevates a mundane domestic problem to dramatic, absurd, memorable heights.<\/em><\/td>\n<\/tr>\n<tr>\n<td><strong>2. Launch Film Series<\/strong><\/td>\n<td><strong>&#8220;The Negotiations&#8221;<\/strong><\/td>\n<td>Spots filmed in the style of high-stakes political thrillers, <em>Tinker Tailor Soldier Spy<\/em>\u00a0meets a bookcase. Parent and preteen sit at opposite ends of a long table in a\u00a0<strong>dim, dramatically lit room<\/strong>, sliding &#8220;terms&#8221; across: a pegboard for gaming gear\u00a0<strong>in exchange for<\/strong>\u00a0a clean floor; a sound-absorbing curtain for privacy\u00a0<strong>in exchange for<\/strong>\u00a0family dinner attendance.\u00a0<em>Every product is a bargaining chip with a story.<\/em><\/td>\n<\/tr>\n<tr>\n<td><strong>3. Visual Payoff<\/strong><\/td>\n<td><strong>&#8220;Treaty Signed&#8221;<\/strong><\/td>\n<td>Each spot resolves in a single, sharp cut: the dim negotiation room gives way to a\u00a0<strong>bright, airy, reorganised living space<\/strong>\u00a0where both parties co-exist happily. The tonal whiplash from spy-thriller gravity to domestic warmth\u00a0<em>is<\/em>\u00a0the joke.\u00a0<em>Designed to be the defining shareable moment audiences remember and recount.<\/em><\/td>\n<\/tr>\n<tr>\n<td><strong>4. Product as Plot<\/strong><\/td>\n<td><strong>Narrative Integration<\/strong><\/td>\n<td>Unlike campaigns where products are set dressing, every item functions as a\u00a0<strong>narrative object,<\/strong> a concession, a peace offering, a treaty clause. The pegboard isn&#8217;t &#8220;organised storage&#8221;; it&#8217;s the term that bought a clean floor. The bin is the clause that secured family movie night.\u00a0<em>Gives each product a story and a reason for being that transcends traditional placement.<\/em><\/td>\n<\/tr>\n<tr>\n<td><strong>5. Tagline Reframe<\/strong><\/td>\n<td><strong>&#8220;Make Home Happen&#8221; as Treaty<\/strong><\/td>\n<td>The company\u2019s existing tagline is repositioned not as an aspiration but as the\u00a0<strong>terms of a negotiated truce,<\/strong> smart organisation that lets parents reclaim visual calm while granting preteens the &#8220;cool functional territory&#8221; they demand.\u00a0<em>Breathes new strategic life into existing brand language.<\/em><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Table 6: Campaign idea produced by Gemini 3.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>The Domestic Peace Accords had the strongest semiotic coherence &#8211; every element, from language (&#8220;cold war,&#8221; &#8220;treaty,&#8221; &#8220;terms&#8221;) to visual style (dim thriller lighting vs. bright domestic reveal) to product role (bargaining chips), reinforced one unified meaning system without contradiction while it also scored the highest Unexpected rating. However, its singular focus proved to be a double-edged sword: by investing entirely in one metaphor executed through one format (film spots), it presented no secondary executions, platforms, or conceptual categories**,** pulling its Total FFE down.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:separator --><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>The top contenders<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":true} --><\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th><strong>Model<\/strong><\/th>\n<th><strong>WPP Score<\/strong><\/th>\n<th><strong>FFE score<\/strong><\/th>\n<th><strong>UOS score<\/strong><\/th>\n<th><strong>UUU score<\/strong><\/th>\n<th><strong>Total Score<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>GPT-5<\/strong><\/td>\n<td><strong>9.50<\/strong><\/td>\n<td><strong>10.00<\/strong><\/td>\n<td><strong>9.60<\/strong><\/td>\n<td><strong>8.34<\/strong><\/td>\n<td><strong>37.44<\/strong><\/td>\n<\/tr>\n<tr>\n<td>Gemini 3 (Creative Brain)<\/td>\n<td>9.21<\/td>\n<td>10.00<\/td>\n<td>9.40<\/td>\n<td>8.34<\/td>\n<td>36.95<\/td>\n<\/tr>\n<tr>\n<td>Claude Sonnet 4.5<\/td>\n<td>8.92<\/td>\n<td>10.00<\/td>\n<td>8.60<\/td>\n<td>7.60<\/td>\n<td>35.12<\/td>\n<\/tr>\n<tr>\n<td>Gemini 3<\/td>\n<td>8.75<\/td>\n<td>6.67<\/td>\n<td>9.20<\/td>\n<td>7.86<\/td>\n<td>32.48<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Table 7: Final individual and total scores of the 4 different LLMs for the furniture company case study.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Three of four models scored a perfect FFE (10.00), meaning raw creative thinking was comparable across the board. The separation came from WPP Score and UOS.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>GPT-5<\/strong>\u00a0posted the highest WPP (9.50). The WPP Agent cited &#8220;multiple direct pathways to purchase and engagement&#8221; and noted &#8220;exceptional focus on the stated business challenge.&#8221; The UOS Agent awarded 9.60: &#8220;meticulously crafted, creatively addressing every aspect of the client&#8217;s brief with seamless logical flow.&#8221;<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Creative Brain<\/strong>\u00a0matched GPT-5 on FFE (10.00) and UUU (8.34). The UUU Agent noted &#8220;the core visual of the room &#8216;glitching&#8217; into an RPG world creates an incredibly strong and distinct defining moment.&#8221; The WPP gap (9.21 vs. 9.50) traced to a coherence flag: &#8220;the heavy reliance on gaming metaphors might alienate or confuse parents who are not immersed in digital culture.&#8221;<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Claude Sonnet 4.5<\/strong>\u00a0earned the cohort&#8217;s highest Usefulness score \u2014 the UOS Agent praised its &#8220;practical alignment with the client&#8217;s brief, particularly in embracing realistic family conflict rather than a &#8216;too warm or safe&#8217; tone.&#8221; The UUU Agent observed it &#8220;takes a common format (reality renovation show) and infuses it with a fresh twist&#8221; \u2014 but the format itself limited differentiation.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Gemini 3<\/strong>\u00a0earned the highest Unexpected rating \u2014 the UUU Agent called the &#8220;juxtaposition of the mundane struggle for space with the gravitas of political thriller negotiations a brilliant flip.&#8221; But the FFE Agent recorded zero Flexibility: &#8220;a single, unified marketing campaign concept&#8221; with &#8220;no multiple distinct conceptual categories,&#8221; and the WPP Agent noted it &#8220;doesn&#8217;t create a new utility or platform.&#8221;<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:separator --><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator --><\/p>\n<p><!-- wp:heading {\"level\":1} --><\/p>\n<h1 class=\"wp-block-heading\"><strong>The creative Elo tournament<\/strong><\/h1>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>A single brief can&#8217;t tell us which model is\u00a0<em>consistently<\/em>\u00a0creative. To answer that, we expanded the experiment: each model was given\u00a0<strong>multiple diverse briefs<\/strong>\u00a0spanning different brands, categories, and creative challenges, and every output was scored by the same evaluation pipeline.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>We needed a ranking system that captured consistency against competition \u2014 not just average scores. An Elo rating is a numerical score that reflects a competitor&#8217;s relative skill based purely on head-to-head outcomes. The higher the rating, the stronger the performer. We turned to\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Glicko_rating_system\"><strong>Glicko-2<\/strong><\/a>, the rating algorithm used in competitive chess, CS:GO, and Dota 2. Every head-to-head match is a data point: if idea A beats idea B, A gains rating and B loses it. Glicko-2 also tracks\u00a0<strong>rating deviation (RD)<\/strong>\u00a0\u2014 a confidence interval that shrinks with more matches.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>The players<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\">\n<!-- wp:list-item --><\/p>\n<li><strong>GPT-5<\/strong><\/li>\n<p><!-- \/wp:list-item --><br \/>\n<!-- wp:list-item --><\/p>\n<li><strong>Gemini 3<\/strong><\/li>\n<p><!-- \/wp:list-item --><br \/>\n<!-- wp:list-item --><\/p>\n<li><strong>Gemini 2.5<\/strong><\/li>\n<p><!-- \/wp:list-item --><br \/>\n<!-- wp:list-item --><\/p>\n<li><strong>Claude Sonnet 4.5<\/strong><\/li>\n<p><!-- \/wp:list-item --><br \/>\n<!-- wp:list-item --><\/p>\n<li><strong>Creative Brain<\/strong>\u00a0\u2014 built on Gemini 3 with an optimised prompting architecture<\/li>\n<p><!-- \/wp:list-item -->\n<\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Four models received a standardised prompt. The\u00a0<strong>Creative Brain<\/strong>\u00a0received the same brief but processed it through its own multi-agent ideation pipeline \u2014 testing whether orchestration outperforms raw model capability.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>The\u00a0<strong>Creativity Evaluation Agent<\/strong>\u00a0judged every idea independently. An orchestration engine simulated head-to-head matches by comparing normalised scores for the same brief. The winner of the matches is the one that has higher normalised composite score on common metrics. The result:\u00a0<strong>210 unique creative matches<\/strong>\u00a0across 5 models and 14 global brands, run over 3 iterations.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>The results<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Ranked across WPP Score<\/strong><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":true} --><\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th><strong>Rank<\/strong><\/th>\n<th><strong>Player<\/strong><\/th>\n<th><strong>Rating<\/strong><\/th>\n<th><strong>RD<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>\ud83e\udd47<\/td>\n<td><strong>Creative Brain<\/strong>\u00a0(Gemini 3)<\/td>\n<td>1889<\/td>\n<td>84.8<\/td>\n<\/tr>\n<tr>\n<td>\ud83e\udd48<\/td>\n<td>GPT-5<\/td>\n<td>1858<\/td>\n<td>92.3<\/td>\n<\/tr>\n<tr>\n<td>\ud83e\udd49<\/td>\n<td>Claude Sonnet 4.5<\/td>\n<td>1529<\/td>\n<td>83.9<\/td>\n<\/tr>\n<tr>\n<td>4<\/td>\n<td>Gemini 3<\/td>\n<td>1169<\/td>\n<td>90.7<\/td>\n<\/tr>\n<tr>\n<td>5<\/td>\n<td>Gemini 2.5<\/td>\n<td>962<\/td>\n<td>104.2<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Table 8: Elo ratings on the WPP score.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Ranked across all frameworks<\/strong><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":true} --><\/p>\n<figure class=\"wp-block-table\">\n<table class=\"has-fixed-layout\">\n<thead>\n<tr>\n<th><strong>Rank<\/strong><\/th>\n<th><strong>Player<\/strong><\/th>\n<th><strong>Rating<\/strong><\/th>\n<th><strong>RD<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>\ud83e\udd47<\/td>\n<td>GPT-5<\/td>\n<td>1940<\/td>\n<td>84.3<\/td>\n<\/tr>\n<tr>\n<td>\ud83e\udd48<\/td>\n<td><strong>Creative Brain<\/strong>\u00a0(Gemini 3)<\/td>\n<td>1927<\/td>\n<td>79.2<\/td>\n<\/tr>\n<tr>\n<td>\ud83e\udd49<\/td>\n<td>Claude Sonnet 4.5<\/td>\n<td>1378<\/td>\n<td>77.4<\/td>\n<\/tr>\n<tr>\n<td>4<\/td>\n<td>Gemini 3<\/td>\n<td>1216<\/td>\n<td>86.6<\/td>\n<\/tr>\n<tr>\n<td>5<\/td>\n<td>Gemini 2.5<\/td>\n<td>861<\/td>\n<td>106.3<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Table 9: Elo ratings across all frameworks.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Creative Brain leads on WPP criteria<\/strong>\u00a0\u2014 delivering a measurable advantage when judged against industry-specific creative standards. The gap between Creative Brain and standalone Gemini 3 is nearly\u00a0<strong>720 rating points<\/strong>.\u00a0<strong>GPT-5 is the strongest all-rounder<\/strong>\u00a0\u2014 topping the all-frameworks leaderboard with ideas that score well across the broadest range of creative dimensions.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:separator --><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator --><\/p>\n<p><!-- wp:heading {\"level\":1} --><\/p>\n<h1 class=\"wp-block-heading\"><strong>Key insights &amp; findings<\/strong><\/h1>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\">\n<!-- wp:list-item --><\/p>\n<li><strong>Creative Brain dramatically outperforms standalone Gemini 3.<\/strong>\u00a0Same underlying model, ~720 rating point gap. The structured ideation process consistently elevated creative output beyond what Gemini 3 could produce alone. Notably, Creative Brain was not optimised against the WPP scoring criteria \u2014 its strong performance emerged naturally from a better creative process.<\/li>\n<p><!-- \/wp:list-item --><br \/>\n<!-- wp:list-item --><\/p>\n<li><strong>The right judge model is foundational.<\/strong>\u00a0It can be seen in Figure 4 that Gemini 2.5 averaged 2.2 error against human ground truth; Claude Sonnet narrowed it to 1.0; Gemini 3 achieved 0.7. Selecting the judge model determines whether the entire system tracks human judgment or drifts from it.<\/li>\n<p><!-- \/wp:list-item --><br \/>\n<!-- wp:list-item --><\/p>\n<li><strong>Human judgment remains essential at the margins.<\/strong>\u00a0When total scores differ by less than a point \u2014 as with GPT-5 (23.37) vs. Creative Brain (22.92) \u2014 the agent surfaces meaningfully different trade-offs that scores alone cannot resolve. The system&#8217;s value is in ensuring the right ideas and evidence reach the table, not replacing human judgment.<\/li>\n<p><!-- \/wp:list-item --><br \/>\n<!-- wp:list-item --><\/p>\n<li><strong>Evaluator reliability is a measured property, not an assumption.<\/strong>\u00a0Across repeated independent runs on the same ideas, scores held stable with low standard deviation. This repeatability is what allows every other finding to be treated as signal rather than noise.<\/li>\n<p><!-- \/wp:list-item -->\n<\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:separator --><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator --><\/p>\n<p><!-- wp:heading {\"level\":1} --><\/p>\n<h1 class=\"wp-block-heading\"><strong>Conclusion &amp; impact<\/strong><\/h1>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>This work demonstrates that\u00a0<strong>scalable, repeatable creative evaluation using LLMs is practical today<\/strong>, provided the system is built with the right scaffolding: calibrated judge models, few-shot anchoring, multi-dimensional scoring, and cross-model validation.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>In practice, the Creativity Evaluation Agent enables:<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\">\n<!-- wp:list-item --><\/p>\n<li><strong>Faster iteration<\/strong>\u00a0\u2014 stress-test dozens of creative directions in minutes rather than weeks, before committing production budgets.<\/li>\n<p><!-- \/wp:list-item --><br \/>\n<!-- wp:list-item --><\/p>\n<li><strong>Comparable benchmarking<\/strong>\u00a0\u2014 evaluate models, prompting strategies, and agentic architectures on common ground with a shared, reproducible rubric.<\/li>\n<p><!-- \/wp:list-item --><br \/>\n<!-- wp:list-item --><\/p>\n<li><strong>Diagnosable feedback<\/strong>\u00a0\u2014 learn not just\u00a0<em>that<\/em>\u00a0an idea underperformed, but\u00a0<em>where<\/em>\u00a0and\u00a0<em>why<\/em>, with dimension-level scores and qualitative commentary that teams can act on immediately.<\/li>\n<p><!-- \/wp:list-item --><br \/>\n<!-- wp:list-item --><\/p>\n<li><strong>Creative governance<\/strong>\u00a0\u2014 an auditable, explainable evaluation process that scales alongside the growing volume of AI-generated creative, giving organisations confidence and consistency as they adopt generative tools.<\/li>\n<p><!-- \/wp:list-item -->\n<\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:separator --><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Ready to explore the specifics? Read our full technical deep dive into the\u00a0Creativity Evaluation Agent Pod <a href=\"https:\/\/research.wpp.com\/pods\/brand-perception-atlas-pod\"><\/a>for a closer look at our methodology.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><em>Disclaimer: This content was created with AI assistance. All research and conclusions are the work of WPP Research<\/em>.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n","related_pods":[1454],"content_quarter":""},"research_categories":[],"raw_acf":{"content":"<!-- wp:paragraph -->\n<p>\"<em>Measuring the unmeasurable. Ranking the unrankable.<\/em>\"<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>In the rapidly evolving landscape of Artificial Intelligence (AI), two questions remain particularly elusive and particularly consequential:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:quote -->\n<blockquote class=\"wp-block-quote\"><!-- wp:paragraph -->\n<p><strong>1. Can AI be truly creative?<\/strong>\u00a0<strong>2. Can we rank AI agents based on their creativity?<\/strong><\/p>\n<!-- \/wp:paragraph --><\/blockquote>\n<!-- \/wp:quote -->\n\n<!-- wp:paragraph -->\n<p>These are not rhetorical questions. They are the founding hypotheses of this project.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The first touches on something that has historically felt beyond measurement: the quality of an original idea. The second demands a fair, repeatable, and objective method for comparison across different types of players, at scale. This challenge is especially sharp in\u00a0<strong>advertising<\/strong>, where creative ideas are the currency of impact. Traditional evaluation relies on subjective human judgment which can be slow, expensive, and inconsistent across reviewers. At the same time, as Large Language Models (LLMs) transform tasks from coding to translation, a critical gap has persisted:\u00a0<strong>how do we benchmark creativity rigorously?<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>At\u00a0the <strong>WPP Research<\/strong>, we ran a set of experiments demonstrating that modern LLMs can act as reliable, scoring agents that grade creative ideas across multiple established dimensions with measurable consistency. That finding unlocked something important: if an LLM can judge creativity, we can build a system that does so systematically and then use that system to rank which AI creates the best ideas. The\u00a0<strong>Creativity Evaluation Agent<\/strong>\u00a0is a modular, multi-agent system built at\u00a0<strong>WPP Research<\/strong>\u00a0that pursues two distinct but deeply intertwined goals:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>Goal 1: Build a scalable creative evaluation engine<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>Design and deploy an AI agent, capable of evaluating any marketing campaign idea automatically and consistently, across\u00a0<strong>six industry-grounded creativity frameworks<\/strong>\u00a0simultaneously. A user submits a campaign idea (text or PDF) via the web UI or API. Then, our multi-agent system evaluates it across all six frameworks in parallel and returns a structured report with dimension-level scores and qualitative commentary in\u00a0<strong>15\u201325 seconds<\/strong>.<br><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:image {\"id\":1737,\"sizeSlug\":\"large\",\"linkDestination\":\"none\"} -->\n<figure class=\"wp-block-image size-large\"><img src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/idea_output-1024x704.png\" alt=\"\" class=\"wp-image-1737\"\/><figcaption class=\"wp-element-caption\"><em>Figure 1: Example output for a creative idea<\/em><\/figcaption><\/figure>\n<!-- \/wp:image -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>Goal 2: Benchmark who creates the best ideas: LLMs, humans, and AI agents<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>The evaluation engine is also the foundation of a\u00a0<strong>creative benchmarking tournament<\/strong>. The second goal is to use the agent as an objective judge to measure and rank the creative output of different\u00a0<em>players.<\/em> For this exercise the players have been state-of-the-art LLMs.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:image {\"id\":1738,\"sizeSlug\":\"large\",\"linkDestination\":\"none\"} -->\n<figure class=\"wp-block-image size-large\"><img src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/creativity_eval-1024x364.png\" alt=\"\" class=\"wp-image-1738\"\/><figcaption class=\"wp-element-caption\">Figure 2: Creative benchmarking tournament<\/figcaption><\/figure>\n<!-- \/wp:image -->\n\n<!-- wp:paragraph -->\n<p>To do this rigorously, we adopted the\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Glicko_rating_system\"><strong>Glicko2 rating system<\/strong><\/a>\u00a0(also used in games such as <a href=\"https:\/\/en.wikipedia.org\/wiki\/Elo_rating_system\">chess Elo<\/a>, <a href=\"https:\/\/www.counter-strike.net\/\">Counter Strike<\/a> and <a href=\"https:\/\/www.dota2.com\/home\">Dota 2<\/a>), running a round-robin tournament where each player's ideas compete head-to-head. The result is a continuously updatable\u00a0<strong>creative leaderboard<\/strong>\u00a0which ranks AI creativity in advertising.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:separator -->\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator -->\n\n<!-- wp:heading {\"level\":1} -->\n<h1 class=\"wp-block-heading\">From foundations to architecture<\/h1>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>To move from subjective opinion to objective evidence, the system stands on the shoulders of giants, synthesising established psychometrics like the\u00a0<a href=\"https:\/\/psycnet.apa.org\/doiLanding?doi=10.1037%2Ft05532-000\"><strong>Torrance Tests of Creative Thinking<\/strong><\/a>\u00a0<strong>(TTCT)<\/strong>\u00a0with industry-proven frameworks. The challenge lies in translation: turning these theoretical foundations into an autonomous agent capable of\u00a0<a href=\"https:\/\/arxiv.org\/pdf\/2504.15784\">automated creativity evaluation<\/a>\u00a0with human-like nuance and explainable logic supported by a\u00a0<a href=\"https:\/\/arxiv.org\/abs\/2310.08433\">confederacy of specialised models<\/a>.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>The multi-agent ecosystem<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>Rather than relying on a single monolithic judge, the system orchestrates a specialised \"squad\" of sub-agents, <strong>each one encoding a distinct evaluation technique from creativity science or brand strategy.<\/strong>\u00a0Some measure the\u00a0<strong>quality of the output<\/strong>\u00a0\u2014 how effective, original, and strategically durable the idea is. Others measure the\u00a0<strong>quality of the thinking<\/strong>\u00a0\u2014 how expansive, surprising, and culturally grounded the generative process behind it is. Together, they cover the full spectrum from practical effectiveness to creative cognition.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table {\"hasFixedLayout\":true} -->\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Agent<\/strong><\/th><th><strong>What It Measures<\/strong><\/th><th><strong>Grounded In<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Effectiveness (WPP)<\/td><td>Does the idea work as a campaign? How sharply framed, how boldly inspired, how relevant, how impactful?<\/td><td>Proprietary WPP creativity framework built around four dimensions of creative excellence, calibrated against real campaign performance across multiple brands and markets. Inspired by <a href=\"https:\/\/hbr.org\/2024\/04\/what-makes-some-ads-so-powerful\">research on inspiration published in\u00a0Harvard Business Review<\/a>, the framework evaluates the \"DNA of what makes ideas inspiring.\"<\/td><\/tr><tr><td>Generative Flow (FFE)<\/td><td>How broad and varied is the thinking? Does the idea explore multiple formats, categories, and angles?<\/td><td>FFE\u00a0(Fluency, Flexibility, Elaboration) dimensions from creativity research, benchmarked against\u00a0marketing creativity datasets.<\/td><\/tr><tr><td>Divergent Creativity (UOS)<\/td><td>Is the idea useful, original, and surprising?<\/td><td>UOS (Usefulness, Originality, Surprise) framework from divergent thinking literature.<\/td><\/tr><tr><td>Creative Strategist (UUU)<\/td><td>Is the idea unique, unexpected, and unforgettable enough to endure?<\/td><td>UUU (Unique, Unexpected, Unforgettable) brand longevity assessment, utilising\u00a0multi-domain evaluation techniques.<\/td><\/tr><tr><td>Conceptual Distance (OSCAI) ([OSCAI - LLM Scoring<\/td><td>Open Creativity Scoring](<a href=\"https:\/\/openscoring.du.edu\/ocsai\">https:\/\/openscoring.du.edu\/ocsai<\/a>))<\/td><td>How far apart are the connected concepts? Distinguishes mundane links from highly original leaps.<\/td><\/tr><tr><td>Semiotics<\/td><td>How is meaning constructed through cultural symbols? Is the creative execution aligned with the intended brand message?<\/td><td><a href=\"https:\/\/en.wikipedia.org\/wiki\/Semiotics#Saussure\">Saussurean sign systems<\/a>\u00a0and\u00a0<a href=\"https:\/\/journals.sagepub.com\/doi\/pdf\/10.1177\/092137400001200102\">cultural logic<\/a>.<\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p>Table 1: Scores description<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:separator -->\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>How the agents work: scoring architecture and self-correction<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>Every sub-agent is governed by the same rigorous four-part prompt architecture. Each is anchored by a specialised\u00a0<strong>Role<\/strong>\u00a0that encodes its evaluation lens, followed by a granular\u00a0<strong>Definition of Score<\/strong>\u00a0that translates abstract dimensions into measurable benchmarks. The core logic is driven by precise\u00a0<strong>Instructions<\/strong>\u00a0calibrated against\u00a0<strong>few-shot examples<\/strong>\u00a0drawn from a ground truth library of ideas and historical scores. This architecture feeds into a\u00a0<strong>Critic-Refiner<\/strong>\u00a0cycle, where specialised Refiner agents challenge initial assessments and resolve contradictions. The refinement doesn't trigger on every run, but its presence is deliberate: we observed during evaluation that LLMs had a tendency towards optimism in scoring, and this self-correction layer ensures the final output remains robust and consistent.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>In other words, the agent descriptions above define\u00a0<strong>what<\/strong>\u00a0each agent evaluates; the shared architecture is\u00a0<strong>how<\/strong>\u00a0they all do it.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:image {\"id\":1739,\"sizeSlug\":\"large\",\"linkDestination\":\"none\"} -->\n<figure class=\"wp-block-image size-large\"><img src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/architecture-1024x502.png\" alt=\"\" class=\"wp-image-1739\"\/><figcaption class=\"wp-element-caption\">Figure 3: The architecture of the Creativity Evaluation Agent.<\/figcaption><\/figure>\n<!-- \/wp:image -->\n\n<!-- wp:heading {\"level\":1} -->\n<h1 class=\"wp-block-heading\"><strong>Aligning sub-agents with human intuition<\/strong><\/h1>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>Before trusting the system, we needed to prove it thinks like a human Creative Director. We assembled a\u00a0<strong>Ground Truth dataset<\/strong>\u00a0in collaboration with WPP creative professionals: 20 campaign ideas, each captured as a title and description, scored by consensus using the WPP effectiveness score. We then ran each idea through our evaluation pipeline across three frontier models and measured the\u00a0<strong>prediction error<\/strong>.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table {\"hasFixedLayout\":true} -->\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Model<\/strong><\/th><th><strong>Avg. Error (Agent vs. Human Scores)<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Gemini 2.5<\/td><td>2.2<\/td><\/tr><tr><td>Claude Sonnet 4.5<\/td><td>1.0<\/td><\/tr><tr><td><strong>Gemini 3<\/strong><\/td><td><strong>0.7<\/strong><\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p>Table 2: Model comparison<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><a href=\"https:\/\/deepmind.google\/models\/gemini\/\">Gemini 3<\/a> emerged as the definitive choice and was deployed across all sub-agents. We also validated\u00a0<strong>repeatability<\/strong>: the same idea, scored across independent runs, held stable with low standard deviation across all frameworks. The WPP score was the most precise signal, but UOS, FFE, and UUU all showed the same core reliability.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:image {\"id\":1740,\"sizeSlug\":\"full\",\"linkDestination\":\"none\"} -->\n<figure class=\"wp-block-image size-full\"><img src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/errors.png\" alt=\"\" class=\"wp-image-1740\"\/><figcaption class=\"wp-element-caption\">Figure 4: error distributions of the creative evaluation agent when powered by Claude Sonnet 4.5, Gemini 2.5 and Gemini 3.<\/figcaption><\/figure>\n<!-- \/wp:image -->\n\n<!-- wp:quote -->\n<blockquote class=\"wp-block-quote\"><!-- wp:paragraph -->\n<p><strong>The takeaway:<\/strong>\u00a0our benchmarks are grounded in data, not AI randomness. Repeatability is a measured property of this system and not an assumption.<\/p>\n<!-- \/wp:paragraph --><\/blockquote>\n<!-- \/wp:quote -->\n\n<!-- wp:heading {\"level\":1} -->\n<h1 class=\"wp-block-heading\"><strong>Case study<\/strong><\/h1>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>With the agent LLM validated, we moved to a real-world stress test for a furniture company. This challenge asked different LLMs to tackle a nuanced brief rooted in a specific cultural tension: North Asian families with millennial parents and preteen children are living under the same roof but feeling worlds apart \u2014 clutter, gaming, and conflicting needs for \"me time\" vs. \"we time\" are eroding family connection. The brief demanded ideas that were cross-market, minimal-dialogue, and crucially,\u00a0<strong>not too warm or safe<\/strong>.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Each LLM's output was run through the full evaluation ecosystem. Every sub-agent independently scored the idea along its respective dimension, and the resulting sub-scores were\u00a0<strong>summed into a single composite total<\/strong>.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>To generate the ideas, we used both\u00a0<strong>standalone frontier LLMs<\/strong>\u00a0(<a href=\"https:\/\/openai.com\/index\/introducing-gpt-5\/\">GPT-5<\/a>, <a href=\"https:\/\/deepmind.google\/models\/gemini\/\">Gemini 3<\/a>, <a href=\"https:\/\/www.anthropic.com\/news\/claude-sonnet-4-5\">Claude 4.5 Sonnet<\/a>) and\u00a0<a href=\"https:\/\/www.wpp.com\/en\/news\/2026\/01\/wpp-launches-agent-hub-on-wpp-open-providing-clients-with-access-to-advanced-agentic-ai\"><strong>WPP's Creative Brain<\/strong><\/a>\u00a0\u2014 a multi-agent ideation system available through WPP Open's Agent Hub that wraps an LLM in a structured creative process, guiding it through strategic reframing, lateral thinking, and iterative refinement before producing a final concept. By testing Creative Brain alongside standalone models, the challenge reveals how much of creative quality comes from the model itself versus the orchestration around it.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Because the raw evaluation pillars operated on different scales \u2014 WPP uses a 0\u201312 sum, FFE averages to 0\u20133, while UOS and UUU average to 0\u20135 \u2014 direct comparison across pillars was misleading. All scores were normalized to a common\u00a0<strong>0\u201310 range<\/strong>\u00a0so that each pillar contributes equally to a\u00a0<strong>maximum total of 40<\/strong>.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>It should be noted that two scoring sub-agents, OSCAI and Semiotics were excluded from the evaluations because of their high <strong>scoring variance<\/strong> (see Section 3.6 of the technical report).<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>1. GPT-5: the gamification grandmaster<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>GPT-5 led the pack with a total score of\u00a0<strong>37.44<\/strong>. Its winning strategy,\u00a0<strong>\"Co\u2011op Mode: Make Home Happen,\"<\/strong>\u00a0reframed the entire experience of home life as a collaborative video game the whole family plays together. The central insight: small home changes don't stick because they don't feel\u00a0<em>rewarding<\/em> but games do. By turning clutter into the \"boss\" and the furniture company\u2019s solutions into unlockable \"side quests,\" GPT-5 built an end-to-end ecosystem that made organisation feel joyful rather than obligatory.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table {\"hasFixedLayout\":true} -->\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Pillar<\/strong><\/th><th><strong>Initiative<\/strong><\/th><th><strong>Description &amp; Impact<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>1. Launch Film<\/strong><\/td><td><strong>\"Clutter is the Boss\"<\/strong><\/td><td>A 60s hero film where a family's living room transforms into a co\u2011op game interface\u2014prompts like\u00a0<strong>\"Inventory Full\"<\/strong>\u00a0and\u00a0<strong>\"Find Calm\"<\/strong>\u00a0appear as they \"equip\" products to defeat the clutter boss. No spoken lines, region-agnostic SFX.\u00a0<em>Reframes organisation as play; makes the campaign cross-market viable without localisation.<\/em><\/td><\/tr><tr><td><strong>2. Social\/UGC<\/strong><\/td><td><strong>\"Home Side Quests\"<\/strong><\/td><td>Weekly micro-challenges (under 10 min) like \"Create a charging dock\" or \"Flip sofa to study.\" Augmented Reality (AR) filters add level meters and badges; families post before\/afters to the\u00a0<strong>#CoOpHome Challenge<\/strong>\u00a0with creator duets from gaming and parent influencers.\u00a0<em>Drives sustained engagement and organic reach through participatory content.<\/em><\/td><\/tr><tr><td><strong>3. Retail\/Shoppable<\/strong><\/td><td><strong>\"Co\u2011op Kits\"<\/strong><\/td><td>Curated in-store and online bundles organised by\u00a0<em>mission<\/em>\u2014Study Calm, Party Fast Reset, Balcony Green Break\u2014each bundling storage + lighting + organisers. Every kit includes a QR\u00a0<strong>\"Quest Card\"<\/strong>\u00a0guiding micro-steps with estimated time saved.\u00a0<em>Turns the purchase into the start of a new quest; bridges content to commerce.<\/em><\/td><\/tr><tr><td><strong>4. In-Store Experience<\/strong><\/td><td><strong>\"Demo Levels\"<\/strong><\/td><td>Stores host timed tidy-up challenges where kids and parents compete together to reorganise a mock room against the clock. Winners earn collectible stickers and discount codes.\u00a0<em>Transforms the retail visit into an extension of the campaign's game logic; drives foot traffic through experiential play.<\/em><\/td><\/tr><tr><td><strong>5. Digital Ads<\/strong><\/td><td><strong>\"Skip the Clutter\"<\/strong><\/td><td>YouTube bumpers using skip-button logic\u2014\"Skip the clutter in 3\u20262\u20261\"\u2014to land a single, punchy product solve within seconds.\u00a0<em>Leverages ad format mechanics as creative device; delivers product utility in pre-roll.<\/em><\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p>Table 3: Campaign idea produced by GPT-5.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The campaign's minimal-dialogue, SFX-driven creative approach ensured cross-market viability across North Asian markets without localisation, while the consistent game language (\"side quests,\" \"boss,\" \"level up,\" \"equip\") created a unified system that scored perfect\u00a0on coherence. Every asset closed with the same recontextualised tagline:\u00a0<em>\"Furniture company. Make Home Happen.\",<\/em> framed not as an aspiration, but as a mission objective.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>2. Creative Brain powered by Gemini: the multiverse architect<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>The Creative Brain powered by Gemini followed closely with a score of\u00a0<strong>36.95<\/strong>. Its\u00a0<strong>\"Room-Sync Chronicles\"<\/strong>\u00a0campaign took a fundamentally different approach: rather than gamifying the\u00a0<em>solution<\/em>, it gamified the\u00a0<em>problem<\/em>. The central reframe, that every piece of \"clutter\" in a preteen's bedroom is actually a physical anchor for their digital and imagined worlds \u2014 gave parents a radically empathetic lens through which to see their child's space. The furniture company storage wasn't positioned as a way to \"hide the mess\" but as a way to \"power the multiverse.\"<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table {\"hasFixedLayout\":true} -->\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Pillar<\/strong><\/th><th><strong>Initiative<\/strong><\/th><th><strong>Description &amp; Impact<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>1. Launch Film<\/strong><\/td><td><strong>\"Reality Reboot\"<\/strong><\/td><td>A cinematic 60s spot where a mother opens her son's door and sees a mess\u2014but the room\u00a0<strong>\"glitches\"<\/strong>\u00a0into a high-fidelity game landscape. A box unit becomes a\u00a0<strong>loot chest<\/strong>; a gaming chair becomes a\u00a0<strong>pilot's cockpit<\/strong>. VFX borrows from game trailers.\u00a0<em>Subverts the \"messy room\" trope by revealing the child's imaginative reality; makes furniture feel epic.<\/em><\/td><\/tr><tr><td><strong>2. Narrative Arc<\/strong><\/td><td><strong>\"Co-Op Reorganisation\"<\/strong><\/td><td>The film follows mother and son \"co-oping\" a reorganisation \u2014 but the goal isn't to\u00a0<em>clean<\/em>. It's to\u00a0<strong>\"optimise the map\"<\/strong>\u00a0for his next quest. Storage becomes power-ups that expand the room's modes: gaming arena, creative studio, family bonding zone.\u00a0<em>Reframes tidying as collaborative strategic upgrade, not parental demand.<\/em><\/td><\/tr><tr><td><strong>3. Social\/AR<\/strong><\/td><td><strong>\"Skin Your Room\"<\/strong><\/td><td>A social platform where kids apply AR filters to their real rooms, overlaying fantasy skins that reveal the\u00a0<strong>\"epic reality\"<\/strong>\u00a0hidden behind the furniture. Kids share their \"skinned\" rooms with parents to bridge the perception gap.<\/td><\/tr><tr><td><strong>4. Brand Positioning<\/strong><\/td><td><strong>\"Powering the Multiverse\"<\/strong><\/td><td>Storage is repositioned as an\u00a0<strong>identity enabler<\/strong>\u00a0for developing preteens \u2014 a tool that supports creativity, gaming, and emerging selfhood. Storage doesn't organise a room; it powers the multiverse being built inside it.\u00a0<em>Elevates product value proposition from functional to emotional and developmental.<\/em><\/td><\/tr><tr><td><strong>5. Strategic Inversion<\/strong><\/td><td><strong>Child-First Perspective<\/strong><\/td><td>The campaign validates the preteen's perspective\u00a0<em>first<\/em>, then invites the parent in, inverting the typical home-brand default of centering the adult buyer's desire for order.\u00a0<em>Boldest strategic choice in the cohort; highest risk, highest differentiation.<\/em><\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p>Table 4: Campaign idea produced by Creative Brain.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Underlying the entire campaign was a provocative strategic choice that the evaluator flagged as both its greatest strength and its primary risk: centering the child's worldview over the parent's. Where most home brands default to the adult buyer's desire for order, Room-Sync starts from the preteen's experience and invites the parent to see through their eyes. This inversion earned it the cohort's highest Unforgettable score, but the Creative Evaluation Agent noted the heavy RPG metaphor might alienate parents unfamiliar with gaming culture.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:separator -->\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>3. Claude Sonnet 4.5: the reality show provocateur<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>Claude Sonnet 4.5 followed with a score of\u00a0<strong>35.12<\/strong>. Its\u00a0<strong>\"The Remodel Squad\"<\/strong>\u00a0strategy took the most grounded, human-first approach of the cohort. Where GPT-5 and Gemini leaned into fantasy and gamification, Claude leaned into\u00a0<em>documentary authenticity,<\/em> positioning real family friction not as a problem to be solved but as the raw material for genuine connection. The core creative bet: audiences are tired of aspirational perfection and will respond to the messy, funny, emotional truth of families actually trying to share space.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table {\"hasFixedLayout\":true} -->\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Pillar<\/strong><\/th><th><strong>Initiative<\/strong><\/th><th><strong>Description &amp; Impact<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>1. Launch Film<\/strong><\/td><td><strong>\"The Beautiful Mess\"<\/strong><\/td><td>A 60s hero spot showing a parent-child duo hilariously debating the furniture company\u2019s storage solutions, arguing over shelf heights, drawer labels, and who gets the corner nook. The punchline: they both want the same thing.\u00a0<em>Designed to feel like a documentary moment, not an ad; builds instant relatability.<\/em><\/td><\/tr><tr><td><strong>2. Content Series<\/strong><\/td><td><strong>\"The Remodel Squad\"<\/strong><\/td><td>6\u20138 episodes (5\u20138 min each) where real families nominate the room causing the most conflict. One preteen + one parent form a \"Remodel Squad\" to transform it using the company\u2019s solutions but\u00a0<strong>they must agree on every single decision.<\/strong>\u00a0Captures hilarious negotiations, compromises, and breakthroughs.\u00a0<em>Friction becomes the creative fuel; the format is inherently dramatic and bingeable.<\/em><\/td><\/tr><tr><td><strong>3. Episode Structure<\/strong><\/td><td><strong>Three-Act Arc<\/strong><\/td><td>Each episode follows:\u00a0<strong>(1)<\/strong>\u00a0the conflict audit (what's wrong, who's to blame),\u00a0<strong>(2)<\/strong>\u00a0the design negotiation (friction-fueled creative process),\u00a0<strong>(3)<\/strong>\u00a0the heartwarming reveal \u2014 not a \"ta-da\" moment, but the family's first\u00a0<em>natural<\/em>\u00a0interaction in the new space.\u00a0<em>Emotional payoff is relational, not just spatial.<\/em><\/td><\/tr><tr><td><strong>4. Social\/Viral<\/strong><\/td><td><strong>\"15-Second Cutdowns\"<\/strong><\/td><td>Bite-sized content isolating the series' most relatable moments: funniest negotiation standoffs, most dramatic before\/afters, quiet breakthrough moments. Designed as standalone viral units that drive viewership back to full episodes.\u00a0<em>Engineered for shareability across short-form platforms.<\/em><\/td><\/tr><tr><td><strong>5. Interactive Tool<\/strong><\/td><td><strong>\"Conflict Zone\" AR App<\/strong><\/td><td>Families scan their own problem rooms and collaboratively visualise the company\u2019s solutions in situ. Families can tag their room's \"conflict level\" and share proposed redesigns. Emphasis on\u00a0<strong>joint decision-making<\/strong>\u00a0over individual play.\u00a0<em>Extends the show's premise into every family's home; sparks \"productive arguments.\"<\/em><\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p>Table 5: Campaign idea produced by Claude Sonnet 4.5.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The campaign's greatest strength was its emotional granularity. Rather than offering a single visual payoff, each episode promised a different family, a different room, and a different set of negotiations \u2014 creating a content engine with built-in variety and repeatability. The Creativity Evaluation Agent awarded it the cohort's highest Usefulness score, for its practical alignment with the brief's tone requirements, but noted that reality-renovation formats carry inherent category familiarity, reflected in its lower Uniqueness score. Every piece of content closed with the family in their transformed space and the line:\u00a0<em>\"Make Home Happen.\"<\/em><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:separator -->\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>4. Gemini 3: the diplomatic provocateur<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>Gemini 3 rounded out the cohort with a score of\u00a0<strong>32.48<\/strong>. Its\u00a0<strong>\"The Domestic Peace Accords\"<\/strong>\u00a0took the boldest\u00a0<em>tonal<\/em>\u00a0swing of the group, making a singular creative bet: position the furniture company\u2019s products as essential diplomatic tools to resolve the \"cold war\" between generations, executed entirely through the visual grammar of geopolitical thrillers. Where other entries built broad ecosystems, this idea invested everything in the power of one perfectly realised metaphor.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table {\"hasFixedLayout\":true} -->\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Pillar<\/strong><\/th><th><strong>Initiative<\/strong><\/th><th><strong>Description &amp; Impact<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>1. Core Metaphor<\/strong><\/td><td><strong>\"Furniture as Diplomacy\"<\/strong><\/td><td>The parent-preteen conflict is reframed as a genuine\u00a0<strong>geopolitical standoff<\/strong> \u2014 a \"cold war\" between factions with irreconcilable demands (minimalist calm vs. messy independence) over limited territory (square footage). The products are repositioned as\u00a0<strong>\"diplomatic tools\"<\/strong>\u00a0that broker peace.\u00a0<em>Elevates a mundane domestic problem to dramatic, absurd, memorable heights.<\/em><\/td><\/tr><tr><td><strong>2. Launch Film Series<\/strong><\/td><td><strong>\"The Negotiations\"<\/strong><\/td><td>Spots filmed in the style of high-stakes political thrillers, <em>Tinker Tailor Soldier Spy<\/em>\u00a0meets a bookcase. Parent and preteen sit at opposite ends of a long table in a\u00a0<strong>dim, dramatically lit room<\/strong>, sliding \"terms\" across: a pegboard for gaming gear\u00a0<strong>in exchange for<\/strong>\u00a0a clean floor; a sound-absorbing curtain for privacy\u00a0<strong>in exchange for<\/strong>\u00a0family dinner attendance.\u00a0<em>Every product is a bargaining chip with a story.<\/em><\/td><\/tr><tr><td><strong>3. Visual Payoff<\/strong><\/td><td><strong>\"Treaty Signed\"<\/strong><\/td><td>Each spot resolves in a single, sharp cut: the dim negotiation room gives way to a\u00a0<strong>bright, airy, reorganised living space<\/strong>\u00a0where both parties co-exist happily. The tonal whiplash from spy-thriller gravity to domestic warmth\u00a0<em>is<\/em>\u00a0the joke.\u00a0<em>Designed to be the defining shareable moment audiences remember and recount.<\/em><\/td><\/tr><tr><td><strong>4. Product as Plot<\/strong><\/td><td><strong>Narrative Integration<\/strong><\/td><td>Unlike campaigns where products are set dressing, every item functions as a\u00a0<strong>narrative object,<\/strong> a concession, a peace offering, a treaty clause. The pegboard isn't \"organised storage\"; it's the term that bought a clean floor. The bin is the clause that secured family movie night.\u00a0<em>Gives each product a story and a reason for being that transcends traditional placement.<\/em><\/td><\/tr><tr><td><strong>5. Tagline Reframe<\/strong><\/td><td><strong>\"Make Home Happen\" as Treaty<\/strong><\/td><td>The company\u2019s existing tagline is repositioned not as an aspiration but as the\u00a0<strong>terms of a negotiated truce,<\/strong> smart organisation that lets parents reclaim visual calm while granting preteens the \"cool functional territory\" they demand.\u00a0<em>Breathes new strategic life into existing brand language.<\/em><\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p>Table 6: Campaign idea produced by Gemini 3.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The Domestic Peace Accords had the strongest semiotic coherence - every element, from language (\"cold war,\" \"treaty,\" \"terms\") to visual style (dim thriller lighting vs. bright domestic reveal) to product role (bargaining chips), reinforced one unified meaning system without contradiction while it also scored the highest Unexpected rating. However, its singular focus proved to be a double-edged sword: by investing entirely in one metaphor executed through one format (film spots), it presented no secondary executions, platforms, or conceptual categories**,** pulling its Total FFE down.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:separator -->\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>The top contenders<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:table {\"hasFixedLayout\":true} -->\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Model<\/strong><\/th><th><strong>WPP Score<\/strong><\/th><th><strong>FFE score<\/strong><\/th><th><strong>UOS score<\/strong><\/th><th><strong>UUU score<\/strong><\/th><th><strong>Total Score<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>GPT-5<\/strong><\/td><td><strong>9.50<\/strong><\/td><td><strong>10.00<\/strong><\/td><td><strong>9.60<\/strong><\/td><td><strong>8.34<\/strong><\/td><td><strong>37.44<\/strong><\/td><\/tr><tr><td>Gemini 3 (Creative Brain)<\/td><td>9.21<\/td><td>10.00<\/td><td>9.40<\/td><td>8.34<\/td><td>36.95<\/td><\/tr><tr><td>Claude Sonnet 4.5<\/td><td>8.92<\/td><td>10.00<\/td><td>8.60<\/td><td>7.60<\/td><td>35.12<\/td><\/tr><tr><td>Gemini 3<\/td><td>8.75<\/td><td>6.67<\/td><td>9.20<\/td><td>7.86<\/td><td>32.48<\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p>Table 7: Final individual and total scores of the 4 different LLMs for the furniture company case study.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Three of four models scored a perfect FFE (10.00), meaning raw creative thinking was comparable across the board. The separation came from WPP Score and UOS.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>GPT-5<\/strong>\u00a0posted the highest WPP (9.50). The WPP Agent cited \"multiple direct pathways to purchase and engagement\" and noted \"exceptional focus on the stated business challenge.\" The UOS Agent awarded 9.60: \"meticulously crafted, creatively addressing every aspect of the client's brief with seamless logical flow.\"<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Creative Brain<\/strong>\u00a0matched GPT-5 on FFE (10.00) and UUU (8.34). The UUU Agent noted \"the core visual of the room 'glitching' into an RPG world creates an incredibly strong and distinct defining moment.\" The WPP gap (9.21 vs. 9.50) traced to a coherence flag: \"the heavy reliance on gaming metaphors might alienate or confuse parents who are not immersed in digital culture.\"<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Claude Sonnet 4.5<\/strong>\u00a0earned the cohort's highest Usefulness score \u2014 the UOS Agent praised its \"practical alignment with the client's brief, particularly in embracing realistic family conflict rather than a 'too warm or safe' tone.\" The UUU Agent observed it \"takes a common format (reality renovation show) and infuses it with a fresh twist\" \u2014 but the format itself limited differentiation.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Gemini 3<\/strong>\u00a0earned the highest Unexpected rating \u2014 the UUU Agent called the \"juxtaposition of the mundane struggle for space with the gravitas of political thriller negotiations a brilliant flip.\" But the FFE Agent recorded zero Flexibility: \"a single, unified marketing campaign concept\" with \"no multiple distinct conceptual categories,\" and the WPP Agent noted it \"doesn't create a new utility or platform.\"<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:separator -->\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator -->\n\n<!-- wp:heading {\"level\":1} -->\n<h1 class=\"wp-block-heading\"><strong>The creative Elo tournament<\/strong><\/h1>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>A single brief can't tell us which model is\u00a0<em>consistently<\/em>\u00a0creative. To answer that, we expanded the experiment: each model was given\u00a0<strong>multiple diverse briefs<\/strong>\u00a0spanning different brands, categories, and creative challenges, and every output was scored by the same evaluation pipeline.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>We needed a ranking system that captured consistency against competition \u2014 not just average scores. An Elo rating is a numerical score that reflects a competitor's relative skill based purely on head-to-head outcomes. The higher the rating, the stronger the performer. We turned to\u00a0<a href=\"https:\/\/en.wikipedia.org\/wiki\/Glicko_rating_system\"><strong>Glicko-2<\/strong><\/a>, the rating algorithm used in competitive chess, CS:GO, and Dota 2. Every head-to-head match is a data point: if idea A beats idea B, A gains rating and B loses it. Glicko-2 also tracks\u00a0<strong>rating deviation (RD)<\/strong>\u00a0\u2014 a confidence interval that shrinks with more matches.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>The players<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\">\n<!-- wp:list-item -->\n<li><strong>GPT-5<\/strong><\/li>\n<!-- \/wp:list-item -->\n<!-- wp:list-item -->\n<li><strong>Gemini 3<\/strong><\/li>\n<!-- \/wp:list-item -->\n<!-- wp:list-item -->\n<li><strong>Gemini 2.5<\/strong><\/li>\n<!-- \/wp:list-item -->\n<!-- wp:list-item -->\n<li><strong>Claude Sonnet 4.5<\/strong><\/li>\n<!-- \/wp:list-item -->\n<!-- wp:list-item -->\n<li><strong>Creative Brain<\/strong>\u00a0\u2014 built on Gemini 3 with an optimised prompting architecture<\/li>\n<!-- \/wp:list-item -->\n<\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:paragraph -->\n<p>Four models received a standardised prompt. The\u00a0<strong>Creative Brain<\/strong>\u00a0received the same brief but processed it through its own multi-agent ideation pipeline \u2014 testing whether orchestration outperforms raw model capability.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The\u00a0<strong>Creativity Evaluation Agent<\/strong>\u00a0judged every idea independently. An orchestration engine simulated head-to-head matches by comparing normalised scores for the same brief. The winner of the matches is the one that has higher normalised composite score on common metrics. The result:\u00a0<strong>210 unique creative matches<\/strong>\u00a0across 5 models and 14 global brands, run over 3 iterations.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>The results<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p><strong>Ranked across WPP Score<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table {\"hasFixedLayout\":true} -->\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Rank<\/strong><\/th><th><strong>Player<\/strong><\/th><th><strong>Rating<\/strong><\/th><th><strong>RD<\/strong><\/th><\/tr><\/thead><tbody><tr><td>\ud83e\udd47<\/td><td><strong>Creative Brain<\/strong>\u00a0(Gemini 3)<\/td><td>1889<\/td><td>84.8<\/td><\/tr><tr><td>\ud83e\udd48<\/td><td>GPT-5<\/td><td>1858<\/td><td>92.3<\/td><\/tr><tr><td>\ud83e\udd49<\/td><td>Claude Sonnet 4.5<\/td><td>1529<\/td><td>83.9<\/td><\/tr><tr><td>4<\/td><td>Gemini 3<\/td><td>1169<\/td><td>90.7<\/td><\/tr><tr><td>5<\/td><td>Gemini 2.5<\/td><td>962<\/td><td>104.2<\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p>Table 8: Elo ratings on the WPP score.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Ranked across all frameworks<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table {\"hasFixedLayout\":true} -->\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Rank<\/strong><\/th><th><strong>Player<\/strong><\/th><th><strong>Rating<\/strong><\/th><th><strong>RD<\/strong><\/th><\/tr><\/thead><tbody><tr><td>\ud83e\udd47<\/td><td>GPT-5<\/td><td>1940<\/td><td>84.3<\/td><\/tr><tr><td>\ud83e\udd48<\/td><td><strong>Creative Brain<\/strong>\u00a0(Gemini 3)<\/td><td>1927<\/td><td>79.2<\/td><\/tr><tr><td>\ud83e\udd49<\/td><td>Claude Sonnet 4.5<\/td><td>1378<\/td><td>77.4<\/td><\/tr><tr><td>4<\/td><td>Gemini 3<\/td><td>1216<\/td><td>86.6<\/td><\/tr><tr><td>5<\/td><td>Gemini 2.5<\/td><td>861<\/td><td>106.3<\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p>Table 9: Elo ratings across all frameworks.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Creative Brain leads on WPP criteria<\/strong>\u00a0\u2014 delivering a measurable advantage when judged against industry-specific creative standards. The gap between Creative Brain and standalone Gemini 3 is nearly\u00a0<strong>720 rating points<\/strong>.\u00a0<strong>GPT-5 is the strongest all-rounder<\/strong>\u00a0\u2014 topping the all-frameworks leaderboard with ideas that score well across the broadest range of creative dimensions.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:separator -->\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator -->\n\n<!-- wp:heading {\"level\":1} -->\n<h1 class=\"wp-block-heading\"><strong>Key insights &amp; findings<\/strong><\/h1>\n<!-- \/wp:heading -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\">\n<!-- wp:list-item -->\n<li><strong>Creative Brain dramatically outperforms standalone Gemini 3.<\/strong>\u00a0Same underlying model, ~720 rating point gap. The structured ideation process consistently elevated creative output beyond what Gemini 3 could produce alone. Notably, Creative Brain was not optimised against the WPP scoring criteria \u2014 its strong performance emerged naturally from a better creative process.<\/li>\n<!-- \/wp:list-item -->\n<!-- wp:list-item -->\n<li><strong>The right judge model is foundational.<\/strong>\u00a0It can be seen in Figure 4 that Gemini 2.5 averaged 2.2 error against human ground truth; Claude Sonnet narrowed it to 1.0; Gemini 3 achieved 0.7. Selecting the judge model determines whether the entire system tracks human judgment or drifts from it.<\/li>\n<!-- \/wp:list-item -->\n<!-- wp:list-item -->\n<li><strong>Human judgment remains essential at the margins.<\/strong>\u00a0When total scores differ by less than a point \u2014 as with GPT-5 (23.37) vs. Creative Brain (22.92) \u2014 the agent surfaces meaningfully different trade-offs that scores alone cannot resolve. The system's value is in ensuring the right ideas and evidence reach the table, not replacing human judgment.<\/li>\n<!-- \/wp:list-item -->\n<!-- wp:list-item -->\n<li><strong>Evaluator reliability is a measured property, not an assumption.<\/strong>\u00a0Across repeated independent runs on the same ideas, scores held stable with low standard deviation. This repeatability is what allows every other finding to be treated as signal rather than noise.<\/li>\n<!-- \/wp:list-item -->\n<\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:separator -->\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator -->\n\n<!-- wp:heading {\"level\":1} -->\n<h1 class=\"wp-block-heading\"><strong>Conclusion &amp; impact<\/strong><\/h1>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>This work demonstrates that\u00a0<strong>scalable, repeatable creative evaluation using LLMs is practical today<\/strong>, provided the system is built with the right scaffolding: calibrated judge models, few-shot anchoring, multi-dimensional scoring, and cross-model validation.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>In practice, the Creativity Evaluation Agent enables:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\">\n<!-- wp:list-item -->\n<li><strong>Faster iteration<\/strong>\u00a0\u2014 stress-test dozens of creative directions in minutes rather than weeks, before committing production budgets.<\/li>\n<!-- \/wp:list-item -->\n<!-- wp:list-item -->\n<li><strong>Comparable benchmarking<\/strong>\u00a0\u2014 evaluate models, prompting strategies, and agentic architectures on common ground with a shared, reproducible rubric.<\/li>\n<!-- \/wp:list-item -->\n<!-- wp:list-item -->\n<li><strong>Diagnosable feedback<\/strong>\u00a0\u2014 learn not just\u00a0<em>that<\/em>\u00a0an idea underperformed, but\u00a0<em>where<\/em>\u00a0and\u00a0<em>why<\/em>, with dimension-level scores and qualitative commentary that teams can act on immediately.<\/li>\n<!-- \/wp:list-item -->\n<!-- wp:list-item -->\n<li><strong>Creative governance<\/strong>\u00a0\u2014 an auditable, explainable evaluation process that scales alongside the growing volume of AI-generated creative, giving organisations confidence and consistency as they adopt generative tools.<\/li>\n<!-- \/wp:list-item -->\n<\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:separator -->\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator -->\n\n<!-- wp:paragraph -->\n<p>Ready to explore the specifics? Read our full technical deep dive into the\u00a0Creativity Evaluation Agent Pod <a href=\"https:\/\/research.wpp.com\/pods\/brand-perception-atlas-pod\"><\/a>for a closer look at our methodology.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><em>Disclaimer: This content was created with AI assistance. All research and conclusions are the work of WPP Research<\/em>.<\/p>\n<!-- \/wp:paragraph -->","content_quarter":"","related_pods":["1454"],"featured":"0","legacy_perspective_source_id":"1456"},"_links":{"self":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/research_feed\/1706","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/research_feed"}],"about":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/types\/research_feed"}],"author":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/users\/18"}],"acf:post":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/research_pods\/1454"}],"wp:attachment":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1706"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1706"},{"taxonomy":"content_type","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcontent_types&post=1706"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fppma_author&post=1706"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}