{"id":1654,"date":"2026-05-11T11:09:11","date_gmt":"2026-05-11T11:09:11","guid":{"rendered":"https:\/\/cms.research.wpp.com\/?post_type=research_feed&#038;p=1654"},"modified":"2026-06-26T07:53:00","modified_gmt":"2026-06-26T07:53:00","slug":"creativity-evaluation-pod-technical-walkthrough","status":"publish","type":"research_feed","link":"https:\/\/cms.research.wpp.com\/?research_feed=creativity-evaluation-pod-technical-walkthrough","title":{"rendered":"Creativity evaluation pod: Technical walkthrough"},"content":{"rendered":"\n<p class=\"is-style-text-annotation is-style-text-annotation--1 wp-block-paragraph\"><em>Creative ideas are the primary driver of advertising impact, yet evaluating them at scale remains stubbornly subjective \u2014 human panels are expensive, slow, inconsistent across evaluators, and impossible to run repeatedly as the volume of AI-generated concepts grows. The core problem is that creativity is multidimensional: a single aggregate score fails to capture whether an idea is original, strategically aligned, culturally resonant, or memorable, and without a shared, repeatable rubric, teams cannot meaningfully compare outputs across models, prompts, or campaigns. To address this, we built the <strong>Creativity Evaluation Agent<\/strong>, which scores marketing ideas in parallel across six established creativity frameworks \u2014 an Internal WPP, UOS, FFE, UUU, OSCAI, and Semiotics scores \u2014 using specialised Large Language Model (LLM) sub-agents with critic-refiner loops to ensure consistency, returning dimension-level scores alongside qualitative commentary in a single structured report. Calibrated against human expert ground truth, the system achieved a scoring error as low as 0.7 points (Gemini 3) with high repeatability on the internal WPP score framework (\u03c3 \u2248 0.21), and in a 210-match tournament across 14 global brands, it reliably differentiated creative quality between five frontier models \u2014 revealing that a specialised agentic creative system consistently outperformed vanilla LLMs given the same brief, giving marketing teams a fast, interpretable, and auditable way to benchmark and iterate on creative output before committing production resources.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>This document details the technical architecture, calibration methodology, and experimental design underlying the system built to address these gaps. For results and strategic findings, read&nbsp;our blog post instead.<\/strong><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">Problem statement and motivation<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">Everyone agrees creativity matters in marketing. Nobody agrees on how to measure it.\n\n\n\n\n\nPut the same campaign idea in front of five reviewers and you&#8217;ll get five different scores. One loves the visual metaphor, another thinks the tagline falls flat, a third is just tired after reviewing thirty concepts before lunch. The scores reflect taste and circumstance as much as they reflect the work. This is fine when you&#8217;re picking between two finalist campaigns in a boardroom \u2014 it falls apart the moment you need to evaluate at scale.\n\n\n\n\n\nAnd scale is exactly what modern marketing demands. Teams are generating more ideas than ever, increasingly with the help of generative AI. They need to&nbsp;<strong>screen hundreds of concepts quickly<\/strong>,&nbsp;<strong>understand what specifically makes one idea stronger than another<\/strong>, and&nbsp;<strong>benchmark creative output<\/strong>&nbsp;across different models, prompts, teams, and time periods. A human review panel can do the first job slowly, the second job inconsistently, and the third job barely at all\u2014while being expensive to convene every time.\n\n\n\n\n\nThe core issues are straightforward:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Subjectivity<\/strong>&nbsp;\u2014 without a shared rubric, two reviewers scoring the same idea can land in completely different places.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Scalability<\/strong>&nbsp;\u2014 manual evaluation doesn&#8217;t survive contact with hundreds of ideas per sprint.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Feedback quality<\/strong>&nbsp;\u2014 a score without explanation is useless for iteration; explanations vary wildly across evaluators.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Cost and repeatability<\/strong>&nbsp;\u2014 assembling expert panels is slow and expensive, and running the same panel twice doesn&#8217;t guarantee the same results.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">What&#8217;s missing is a system that can apply&nbsp;<strong>structured, reproducible, explainable<\/strong>&nbsp;creativity assessment across large volumes of work \u2014 fast enough to be useful and consistent enough to be trusted.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">1. Introduction and solution overview<\/h1>\n\n\n\n<p class=\"wp-block-paragraph\">Evaluating marketing creativity at scale demands more than a single score from a single judge. The&nbsp;<strong>Creativity Evaluation Agent<\/strong>, built on&nbsp;<a href=\"https:\/\/google.github.io\/adk-docs\/\">Google&#8217;s Agent Development Kit (ADK)<\/a>, extends the established&nbsp;<em>LLM-as-a-Judge<\/em>&nbsp;paradigm by introducing a multi-agent system in which specialised sub-agents score marketing ideas across&nbsp;<strong>six complementary creativity frameworks<\/strong>, each covering a distinct slice of what practitioners consider &#8220;good creativity.&#8221;\n\n\n\n\n\nEvery scoring sub-agent is grounded through&nbsp;<strong>few-shot examples<\/strong>&nbsp;that teach the underlying LLM how the creative dimension it owns should be measured, narrowing the gap between automated and human judgement. The system accepts&nbsp;<strong>text, image, video, and PDF<\/strong>&nbsp;inputs, runs framework evaluations in parallel, and returns dimension-level scores together with qualitative commentary. It is accessible through both an&nbsp;<strong>API<\/strong>&nbsp;and a&nbsp;<strong>web UI<\/strong>.<\/p>\n\n\n\n<h1 class=\"wp-block-heading\">2. Technical approach<\/h1>\n\n\n\n<h3 class=\"wp-block-heading\">2.1. Architecture overview<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">At a high level, a user&#8217;s idea is received by a&nbsp;<strong>Root Agent<\/strong>, which parses the input and routes it to a&nbsp;<strong>Dynamic Parallel Orchestrator<\/strong>. The orchestrator spins up only the scoring pipelines the user has requested, runs them concurrently, and hands their outputs to a&nbsp;<strong>Report Agent<\/strong>&nbsp;that merges everything into a single structured JSON response.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"502\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/1-1024x502.jpg\" alt=\"\" class=\"wp-image-1781\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/1-1024x502.jpg 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/1-300x147.jpg 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/1-768x377.jpg 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/1-1536x754.jpg 1536w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/1-2048x1005.jpg 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 1: Architecture of the Creativity Evaluation Agent.<\/figcaption><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">2.2. Custom orchestration engine<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The core implementation is the&nbsp;<strong><code>Creativity Evaluation Agent<\/code><\/strong>, a custom ADK&nbsp;<code>BaseAgent<\/code>. Its behaviour breaks down as follows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Root Agent.<\/strong>&nbsp;The user-facing entry point. It can answer questions, hold a conversation, and \u2014 when the user supplies a creative idea \u2014 forward it for evaluation. It decides&nbsp;<em>which<\/em>&nbsp;frameworks to invoke based on the user&#8217;s request.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Dynamic pipeline construction.<\/strong>&nbsp;Pipelines are&nbsp;<em>not<\/em>&nbsp;built ahead of time. Based on what the user asks for, the orchestrator assembles only the relevant evaluation chains, then executes them&nbsp;<strong>in parallel<\/strong>.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Critic\u2013refiner loop.<\/strong>&nbsp;After initial scoring, each pipeline runs a bounded critic\u2013refiner cycle (up to&nbsp;<strong>two iterations<\/strong>) in which a critic agent reviews the scores for obvious errors or inconsistencies. If the critic flags an issue, the refiner adjusts before the result is finalised.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Report Agent.<\/strong>&nbsp;Once all pipelines complete, this agent compiles dimension-level scores and qualitative commentary into a single, consistently formatted output. When the user has submitted multiple ideas, the report includes a&nbsp;<strong>comparative analysis<\/strong>.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Each scoring pipeline is built around an&nbsp;<code>LlmAgent<\/code>&nbsp;instance initialised with a detailed system message encoding its evaluation lens, together with&nbsp;<strong>few-shot examples<\/strong>&nbsp;that anchor outputs close to human scoring behaviour. Scores are emitted as&nbsp;<strong>continuous values<\/strong>&nbsp;(e.g. 1.2, 3.7) rather than discrete integers, matching the granularity of the ground-truth datasets. The six pipelines are:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>WPP Score Pipeline.<\/strong>&nbsp;Scores ideas against a proprietary WPP creativity framework built around four dimensions:\n<ul class=\"wp-block-list\">\n<li>how sharply the idea frames the business challenge, not just the marketing opportunity<\/li>\n\n\n\n<li>how boldly it challenges category convention and subverts clich\u00e9s<\/li>\n\n\n\n<li>how authentically the proposed solution fits the brand and resonates with the audience) and<\/li>\n\n\n\n<li>the scale of measurable growth and emotional response it is designed to deliver<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Each dimension is scored 1\u20133 points, composited into an index ranging 0\u201312. The framework was calibrated against real-world campaign performance across multiple brands and markets. Few-shot examples are drawn from WPP&#8217;s internal archive of historically scored campaigns.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong><a href=\"https:\/\/arxiv.org\/abs\/2510.04009\">Usefulness, Originality and Suprise<\/a> (UOS) Pipeline.<\/strong> Evaluates the classic definition of divergent creative value through three dimensions:&nbsp;<strong>Usefulness<\/strong>&nbsp;(does it solve a real problem and align with the brief&#8217;s constraints?),&nbsp;<strong>Originality<\/strong>&nbsp;(does it approach the problem in a novel way?), and&nbsp;<strong>Surprise<\/strong>&nbsp;(does it deliver an unexpected twist that captures attention?). The three dimension scores are aggregated into an overall UOS score.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong><a href=\"https:\/\/arxiv.org\/pdf\/2401.12491\">Fluency, Flexibility and Elaboration (FFE) Pipeline<\/a>.<\/strong> Quantifies the &#8220;mental engine&#8221; behind the idea through three dimensions:&nbsp;<strong>Fluency<\/strong>&nbsp;(how many distinct, relevant ideas are presented),&nbsp;<strong>Flexibility<\/strong>&nbsp;(how many different conceptual categories are explored), and&nbsp;<strong>Elaboration<\/strong>&nbsp;(how richly detailed and refined the idea is). Grounded in creativity research literature and benchmarked against marketing creativity datasets. Few-shot examples are generated using <a href=\"https:\/\/en.wikipedia.org\/wiki\/Torrance_Tests_of_Creative_Thinking\">Torrance-style<\/a> divergent thinking tasks (see Section 4.1).<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong><a href=\"https:\/\/arxiv.org\/pdf\/2509.09702\">Unique, Unexpected and Unforgettable (UUU) Pipeline<\/a>.<\/strong> Assesses brand longevity through the lens of a Creative Strategist:&nbsp;<strong>Unique<\/strong>&nbsp;(could only this idea deliver this message in this way?),&nbsp;<strong>Unexpected<\/strong>&nbsp;(does it subvert expectations and force re-evaluation?), and&nbsp;<strong>Unforgettable<\/strong>&nbsp;(does it create a defining moment that lives rent-free in the audience&#8217;s mind?). Each dimension is scored as a continuous value and averaged into an overall UUU score.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong><a href=\"https:\/\/www.sciencedirect.com\/science\/article\/abs\/pii\/S1871187123001256?via%3Dihub\">OSCAI<\/a> Pipeline.<\/strong>&nbsp;A two-stage pipeline for measuring conceptual distance. First, a sub-agent extracts&nbsp;<strong>semantic relations<\/strong>&nbsp;from the idea (e.g.,&nbsp;<em>man \u2192 eats \u2192 apple<\/em>). Those relations are sent to the&nbsp;<strong>OSCAI API<\/strong>, maintained by the framework&#8217;s original authors, which scores each relation&#8217;s originality \u2014 distinguishing mundane links (<em>a chef cooks dinner<\/em>) from highly original relationships (<em>a clown teaches mathematics<\/em>). The returned scores quantify the creative leap at the heart of the idea.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Semiotics Pipeline.<\/strong>&nbsp;Applies the <a href=\"https:\/\/link.springer.com\/chapter\/10.1007\/978-1-4757-9700-8_3\">Saussurean principles<\/a> of sign systems to decode how meaning is constructed through cultural symbols. The sub-agent analyses&nbsp;<strong>Denotation<\/strong>&nbsp;(literal content),&nbsp;<strong>Connotation<\/strong>&nbsp;(implied meaning),&nbsp;<strong>Myth<\/strong>&nbsp;(cultural narratives reinforced or challenged), the&nbsp;<strong>Semiotic Relation<\/strong>&nbsp;(additive, contradictory, etc.),&nbsp;<strong>Risks or tensions<\/strong>, and produces a&nbsp;<strong>Semiotic Coherence Score<\/strong>&nbsp;(0\u20133). Unlike the other pipelines, no few-shot examples are used \u2014 evaluation relies on the model&#8217;s inherent understanding of semiotic theory.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>2.3. Score normalization<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The six frameworks operate on different native scales:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th><strong>Framework<\/strong><\/th><th><strong>Native scoring<\/strong><\/th><th><strong>Range<\/strong><\/th><\/tr><\/thead><tbody><tr><td>WPP Score<\/td><td>Sum of 4 dimensions, each 0\u20133<\/td><td>0\u201312<\/td><\/tr><tr><td>FFE<\/td><td>Average of 3 dimensions, each 0\u20133<\/td><td>0\u20133<\/td><\/tr><tr><td>UOS<\/td><td>Average of 3 dimensions, each 0\u20135<\/td><td>0\u20135<\/td><\/tr><tr><td>UUU<\/td><td>Average of 3 dimensions, each 0\u20135<\/td><td>0\u20135<\/td><\/tr><tr><td>OSCAI<\/td><td>Single score<\/td><td>0\u20135<\/td><\/tr><tr><td>Semiotics<\/td><td>Coherence score<\/td><td>0\u20133<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Table 1: Creativity scores and their range.\n\n\n\n\n\nDirect comparison or summation across frameworks is misleading without normalisation. The following procedure is applied:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Convert sums to averages.<\/strong>&nbsp;The WPP Score (a sum of four 0\u20133 dimensions) is divided by 4 to produce a 0\u20133 average, making it structurally comparable to other averaged scores.<\/li>\n<\/ol>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Rescale to a common 0\u201310 range.<\/strong>&nbsp;Each framework&#8217;s score is divided by its native maximum and multiplied by 10: <code>normalised_score = (raw_score \/ max_score) \u00d7 10<\/code><\/li>\n<\/ol>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Composite total.<\/strong>&nbsp;The 4 normalised pillar scores (WPP, FFE, UOS, UUU) are summed into a composite total with a&nbsp;<strong>maximum of 40<\/strong>, each pillar contributing equally.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>OSCAI and Semiotics<\/strong>&nbsp;are reported as standalone scores and are&nbsp;<strong>not<\/strong>&nbsp;included in the composite total. This decision was made because OSCAI depends on an external API with different reliability characteristics, and both OSCAI and Semiotics showed higher inter-run variability (\u03c3 \u2248 0.80, see Section 4.3), which would add noise to the composite.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>2.4. Infrastructure and deployment<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th><strong>Concern<\/strong><\/th><th><strong>Technology<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Multi-agent orchestration<\/td><td><a href=\"https:\/\/google.github.io\/adk-docs\/\">ADK<\/a><\/td><\/tr><tr><td>Compute<\/td><td><a href=\"https:\/\/cloud.google.com\/run\">Google Cloud Run<\/a><\/td><\/tr><tr><td>LLM inference<\/td><td><a href=\"https:\/\/cloud.google.com\/vertex-ai\">Vertex AI<\/a>&nbsp;\u2014&nbsp;<strong>Gemini 3 Pro<\/strong>&nbsp;as the primary judge model<\/td><\/tr><tr><td>Observability<\/td><td><a href=\"https:\/\/cloud.google.com\/logging\">Cloud Logging<\/a>&nbsp;+&nbsp;<a href=\"https:\/\/docs.cloud.google.com\/trace\/docs\">Cloud Trace<\/a><\/td><\/tr><tr><td>Container registry<\/td><td><a href=\"https:\/\/docs.cloud.google.com\/artifact-registry\/docs\/overview\">Artifact Registry<\/a><\/td><\/tr><tr><td>Agent-to-agent protocol<\/td><td><a href=\"https:\/\/a2a-protocol.org\/latest\/\">A2A<\/a><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Table 2: Employed Google teck stack.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">3. Ground truth data and system evaluation<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>3.1 Dataset overview<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Reliable automated scoring requires credible ground truth. Because no single public dataset covers all six frameworks, a combination of&nbsp;<strong>historical data<\/strong>&nbsp;and&nbsp;<strong>synthetically generated ground truth<\/strong>&nbsp;was used.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>3.2. WPP score \u2014 historical human judgements<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The ground-truth dataset comes from&nbsp;<strong>WPP&#8217;s internal archive<\/strong>&nbsp;of marketing campaign ideas submitted between 2020 and 2023. Creative professionals scored each idea across the WPP dimensions. From this corpus,&nbsp;<strong>6 scored ideas<\/strong>&nbsp;were selected as few-shot examples for the scoring sub-agent and&nbsp;<strong>10 additional ideas<\/strong>&nbsp;were reserved for its critic agent.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>3.3. FFE \u2014 synthetic data via Torrance-style tasks<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Few-shot examples for the FFE framework were generated using&nbsp;<strong>Gemini<\/strong>&nbsp;prompted with tasks modelled on the&nbsp;<a href=\"https:\/\/psycnet.apa.org\/doiLanding?doi=10.1037%2Ft05532-000\">Torrance Tests of Creative Thinking<\/a>. Each example pairs a divergent-thinking task, a response, a score, and a justification. For instance:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><strong>Example 1 (Score 0):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Task:<\/strong>&nbsp;Please list unusual uses of a plastic bottle.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Response:<\/strong>&nbsp;1. Plant a seed in it. 2. Use it to water plants by poking holes. 3. Cut it in half to make a small planter. 4. Use it to store extra fertiliser.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Justification:<\/strong>&nbsp;All ideas fall under a single, narrow category (Gardening \/ Horticulture). No conceptual shift is demonstrated.<\/li>\n<\/ul>\n<\/blockquote>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>3.4. UOS &amp; UUU \u2014 community-sourced creative writing<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">No pre-existing ground truth was available for these two frameworks, so it was constructed in three steps:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Source corpus.<\/strong>&nbsp;The&nbsp;<a href=\"https:\/\/huggingface.co\/datasets\/euclaise\/WritingPrompts_preferences\">Creative Storytelling dataset<\/a>&nbsp;(stories from r\/WritingPrompts on Hugging Face) was used as raw material.<\/li>\n<\/ol>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Quality stratification.<\/strong>&nbsp;Stories were sorted by upvotes;&nbsp;<strong>11 highly upvoted<\/strong>&nbsp;and&nbsp;<strong>11 low-voted<\/strong>&nbsp;examples were selected to represent the ends of the quality spectrum.<\/li>\n<\/ol>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Automated annotation.<\/strong>&nbsp;These 22 stories, together with the formal definitions of the UOS and UUU dimensions, were fed to Gemini, which produced scored examples that serve as the few-shot ground truth for both frameworks.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>3.5. Scoring format<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">All scoring sub-agents output&nbsp;<strong>continuous values<\/strong>&nbsp;(e.g. 1.2, 2.8) rather than rounding to integers. This decision was made to stay consistent with the WPP ground-truth scores, which are themselves continuous, and to preserve finer-grained distinctions between ideas.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3.6. Variability analysis<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Reliability was assessed by scoring 55 campaign ideas (one per model) for a popular beverage brand, three times each. Each of the 5 ideas (one per AI model) was rated 3 times by the benchmark agent, and the standard deviation (std) across those 3 runs was computed per score. Each bar shows the average std across all 5 models, so taller bars mean the agent scores that dimension less consistently.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>FFE (Fluency, Flexibility and Elaboration) score shows moderate variability with average std ~ 0.3. Further examination of each constituent creativity aspect evaluated by the FFE score, showed that the deviation is skewed because of Fluency\u2019s variance. This can be due to how Fluency is defined, which is \u201cEvaluate how many distinct, relevant ideas or solutions are presented. Count only meaningful and contextually appropriate ones (avoid repetition or vague statements).\u201d \u2014 an inherently count-based metric where the boundary between &#8220;distinct&#8221; and &#8220;overlapping&#8221; ideas introduces subjective judgment for an LLM.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li>OSCAI &amp; Semiotics show moderate variability with average std ~0.8, which directly motivated their exclusion from the composite tournament score.<\/li>\n<\/ul>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"614\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/2_small-1024x614.png\" alt=\"\" class=\"wp-image-1783\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/2_small-1024x614.png 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/2_small-300x180.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/2_small-768x461.png 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/2_small.png 1200w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 2: Average Score Variability of each scoring sub agent.<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"639\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/3_small-1024x639.png\" alt=\"\" class=\"wp-image-1784\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/3_small-1024x639.png 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/3_small-300x187.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/3_small-768x479.png 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/3_small.png 1200w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 3: Score variability for the Fluency, Flexibility and Elaboration aspects that the FFE score measures.<\/figcaption><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>4. LLM evaluation tournament<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Full tournament results and key findings are covered in the blog post. This section documents the experimental design, technical implementation, per-framework results, and supplementary analysis.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>4.1 Experimental design<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Players.<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th><strong>Player<\/strong><\/th><th><strong>Description<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>GPT-5<\/strong><\/td><td>Standalone, standardised prompt<\/td><\/tr><tr><td><strong>Gemini 3<\/strong><\/td><td>Standalone, standardised prompt<\/td><\/tr><tr><td><strong>Gemini 2.5<\/strong><\/td><td>Standalone, standardised prompt<\/td><\/tr><tr><td><strong>Claude Sonnet 4.5<\/strong><\/td><td>Standalone, standardised prompt<\/td><\/tr><tr><td><strong>Creative Brain<\/strong><\/td><td>WPP&#8217;s multi-agent ideation system, built on Gemini 3<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Table 3: The LLMs that took part in the evaluation tournament.\n\n\n\n\n\n<strong>Prompt design.<\/strong>&nbsp;Four standalone models received a standardised, neutral prompt to ensure a level playing field:<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><em>&#8220;Give me a creative marketing idea\/campaign based on the brief. Your output must have a title and three sections: Challenge, Core Idea, and Execution.&#8221;<\/em><\/p>\n<\/blockquote>\n\n\n\n<p class=\"wp-block-paragraph\">The&nbsp;<strong>Creative Brain<\/strong>&nbsp;received the same brief but processed it through its own multi-agent ideation pipeline \u2014 testing whether agentic orchestration outperforms raw model capability given identical inputs.\n\n\n\n\n\n<strong>Briefs.<\/strong>&nbsp;Each model generated ideas for&nbsp;<strong>14 global brands<\/strong>&nbsp;spanning different categories and creative challenges.\n\n\n\n\n\n<strong>Iterations.<\/strong>&nbsp;Each model\u2013brand combination was run&nbsp;<strong>3 times<\/strong>, producing independent idea generations to account for output variance.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>4.2. Implementation details<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Scoring.<\/strong>&nbsp;Gemini 3 was selected as the LLM that powered our scoring sub-agents. Rather than relying on side-by-side LLM comparisons (which can be inconsistent), the Creativity Evaluation Agent judged every idea&nbsp;<strong>independently<\/strong>, generating a structured creativity report with raw scores across all frameworks. This independent-scoring approach means each idea has a self-contained evaluation record that can be compared post hoc, eliminating ordering effects that plague pairwise LLM judging.\n\n\n\n\n\n<strong>Match simulation.<\/strong>&nbsp;An orchestration engine simulated head-to-head matches by computing the&nbsp;<strong>normalised score average<\/strong>&nbsp;across all evaluated frameworks for each idea on the same brief. Normalisation was applied per-framework to prevent any single framework from dominating (e.g., WPP scores range 0\u201312 while UUU averages range 1\u20135). For each brief, every pair of models was matched: the model with the higher normalised average won the match, the other lost. Draws were not permitted; in the event of an exact tie on normalised average, the match was recorded as a draw in <a href=\"https:\/\/www.notion.so\/Creativity-evaluation-agent-technical-summary-3302494e8bd9808e9260ec014c474de2?pvs=21\">Glicko-2<\/a> (outcome = 0.5).\n\n\n\n\n\n<strong>Glicko-2 parameters.<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th><strong>Parameter<\/strong><\/th><th><strong>Value<\/strong><\/th><th><strong>Rationale<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Initial rating (\u03bc\u2080)<\/td><td>1500<\/td><td>Standard Glicko-2 default<\/td><\/tr><tr><td>Initial rating deviation (RD\u2080)<\/td><td>350<\/td><td>Standard Glicko-2 default; reflects maximum uncertainty<\/td><\/tr><tr><td>System volatility (\u03c3)<\/td><td>0.06<\/td><td>Standard default; controls expected rating fluctuation per period<\/td><\/tr><tr><td>Convergence tolerance (\u03c4)<\/td><td>0.000001<\/td><td>For the iterative volatility update step<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Table 4: Glicko-2 parameter initialisation.\n\n\n\n\n\nRatings were updated after each complete round-robin cycle across all 14 briefs before proceeding to the next iteration. This means each &#8220;rating period&#8221; contained C(5,2) \u00d7 14 =&nbsp;<strong>140 matches<\/strong>&nbsp;(every pair of 5 models on every brief), and three rating periods were processed in sequence for the three iterations.\n\n\n\n\n\n<strong>Scale.<\/strong>&nbsp;Total matches: 3 iterations \u00d7 10 pairs \u00d7 14 briefs \u00d7 (1 match per pair-brief) =&nbsp;<strong>210 unique creative matches<\/strong>&nbsp;across the tournament per ranking method. For per-framework rankings, the same 210-match structure was applied but using the single-framework score (normalised) rather than the cross-framework composite.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>4.3 Per-framework leaderboards<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To understand&nbsp;<em>where<\/em>&nbsp;each model&#8217;s strengths and weaknesses lie, the same Glicko-2 tournament was run using each individual evaluation framework&#8217;s scores as the match-outcome criterion. The results reveal meaningfully different competitive profiles across creative dimensions. We omit the WPP and the aggregate Elo scores since they are available in the executive summary. Additionally we omit the Semiotics and OSCAI Elo scores due to their high scoring variance.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>4.3.1 FFE (Fluency, Flexibility, Elaboration)<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th><strong>Rank<\/strong><\/th><th><strong>Player<\/strong><\/th><th><strong>Rating<\/strong><\/th><th><strong>RD<\/strong><\/th><\/tr><\/thead><tbody><tr><td>\ud83e\udd47<\/td><td>GPT-5<\/td><td>1895<\/td><td>88.4<\/td><\/tr><tr><td>\ud83e\udd48<\/td><td><strong>Creative Brain<\/strong>&nbsp;(Gemini 3)<\/td><td>1651<\/td><td>71.8<\/td><\/tr><tr><td>\ud83e\udd49<\/td><td>Claude Sonnet 4.5<\/td><td>1583<\/td><td>72.2<\/td><\/tr><tr><td>4<\/td><td>Gemini 3<\/td><td>1358<\/td><td>77.8<\/td><\/tr><tr><td>5<\/td><td>Gemini 2.5<\/td><td>1285<\/td><td>74.1<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Table 5: Elo ratings on FFE score.\n\n\n\n\n\nGPT-5 leads comfortably on FFE metrics. The gap between Creative Brain and Claude Sonnet 4.5 is narrow (~68 points), suggesting comparable idea elaboration depth. All Rating deviation (RD) values are below 89, indicating stable ratings.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>4.3.2 UOS (Uniqueness, Originality, Surprise)<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th><strong>Rank<\/strong><\/th><th><strong>Player<\/strong><\/th><th><strong>Rating<\/strong><\/th><th><strong>RD<\/strong><\/th><\/tr><\/thead><tbody><tr><td>\ud83e\udd47<\/td><td><strong>Creative Brain<\/strong>&nbsp;(Gemini 3)<\/td><td>2006<\/td><td>95.5<\/td><\/tr><tr><td>\ud83e\udd48<\/td><td>GPT-5<\/td><td>1657<\/td><td>83.6<\/td><\/tr><tr><td>\ud83e\udd49<\/td><td>Gemini 3<\/td><td>1567<\/td><td>81.7<\/td><\/tr><tr><td>4<\/td><td>Claude Sonnet 4.5<\/td><td>1283<\/td><td>84.7<\/td><\/tr><tr><td>5<\/td><td>Gemini 2.5<\/td><td>1009<\/td><td>129.0<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Table 6: Elo ratings on UOS score.\n\n\n\n\n\nCreative Brain dominates originality, with a&nbsp;<strong>349-point lead<\/strong>&nbsp;over GPT-5 \u2014 the widest gap between the top two players in any framework. Notably, standalone Gemini 3 ranks 3rd here (above Claude Sonnet 4.5), suggesting the base model has latent originality that Creative Brain&#8217;s orchestration amplifies dramatically. Gemini 2.5&#8217;s elevated RD (129.0) indicates volatile originality performance.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>4.3.3 UUU (Unexpected, Useful, Ultra-specific)<\/strong><\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th><strong>Rank<\/strong><\/th><th><strong>Player<\/strong><\/th><th><strong>Rating<\/strong><\/th><th><strong>RD<\/strong><\/th><\/tr><\/thead><tbody><tr><td>\ud83e\udd47<\/td><td><strong>Creative Brain<\/strong>&nbsp;(Gemini 3)<\/td><td>2028<\/td><td>107.5<\/td><\/tr><tr><td>\ud83e\udd48<\/td><td>GPT-5<\/td><td>1653<\/td><td>82.8<\/td><\/tr><tr><td>\ud83e\udd49<\/td><td>Gemini 3<\/td><td>1441<\/td><td>79.0<\/td><\/tr><tr><td>4<\/td><td>Claude Sonnet 4.5<\/td><td>1377<\/td><td>78.2<\/td><\/tr><tr><td>5<\/td><td>Gemini 2.5<\/td><td>958<\/td><td>100.2<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Table 7: Elo ratings on UUU score.\n\n\n\n\n\nCreative Brain achieves its highest absolute rating (2028) on UUU \u2014 a&nbsp;<strong>375-point lead<\/strong>&nbsp;over GPT-5. This framework rewards ideas that are simultaneously surprising&nbsp;<em>and<\/em>&nbsp;actionable, which aligns with the multi-agent pipeline&#8217;s design goal: push for unexpected angles while grounding them in executable detail. Creative Brain&#8217;s slightly elevated RD (107.5) suggests occasional variance, but the margin is decisive.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>4.4 Cross-framework analysis<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The per-framework breakdowns reveal distinct competitive profiles:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th><strong>Player<\/strong><\/th><th><strong>FFE<\/strong><\/th><th><strong>UOS<\/strong><\/th><th><strong>UUU<\/strong><\/th><th><strong>WPP score<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>Creative Brain<\/strong><\/td><td>1651 (2nd)<\/td><td><strong>2006 (1st)<\/strong><\/td><td><strong>2028 (1st)<\/strong><\/td><td><strong>1889 (1st)<\/strong><\/td><\/tr><tr><td><strong>GPT-5<\/strong><\/td><td><strong>1895 (1st)<\/strong><\/td><td>1657 (2nd)<\/td><td>1653 (2nd)<\/td><td>1858 (2nd)<\/td><\/tr><tr><td><strong>Claude Sonnet 4.5<\/strong><\/td><td>1583 (3rd)<\/td><td>1283 (4th)<\/td><td>1377 (4th)<\/td><td>1529 (3rd)<\/td><\/tr><tr><td><strong>Gemini 3<\/strong><\/td><td>1358 (4th)<\/td><td>1567 (3rd)<\/td><td>1441 (3rd)<\/td><td>1169 (4th)<\/td><\/tr><tr><td><strong>Gemini 2.5<\/strong><\/td><td>1285 (5th)<\/td><td>1009 (5th)<\/td><td>958 (5th)<\/td><td>962 (5th)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Table 8: Aggregate Elo ratings.\n\n\n\n\n\n<strong>Key patterns:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Creative Brain&#8217;s advantage is most scores.<\/strong>&nbsp;It ranks 1st on UOS, UUU, and WPP score while it drops to 2nd on FFE.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>GPT-5 is the second best when it comes to creativity.<\/strong>&nbsp;It ranks 1st or 2nd on every single framework. Its weakest showing is 2nd place on UOS, UUU and WPP score, behind Creative Brain.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Claude Sonnet 4.5 has a spiked profile.<\/strong>&nbsp;Competitive on FFE (3rd, close to Creative Brain), but drops to 4th on UOS and UUU. This suggests its outputs are well-elaborated but less likely to produce unexpected or surprising creative leaps.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Gemini 3 benefits substantially from the complex orchestration that Creative Brain introduces.<\/strong>&nbsp;Across every framework, Creative Brain outperforms standalone Gemini 3 \u2014 the smallest gap is ~293 points (FFE) and the largest is ~720 points (WPP score).<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>4.5 Rating deviation &amp; confidence<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Rating deviation (RD) indicates how confident the system is in each player&#8217;s rating \u2014 lower RD means more predictable performance and a more stable estimate.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><thead><tr><th><strong>Player<\/strong><\/th><th><strong>Avg RD<\/strong><\/th><th><strong>Min RD<\/strong><\/th><th><strong>Max RD<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>Creative Brain<\/strong><\/td><td>89.9<\/td><td>71.8 (FFE)<\/td><td>107.5 (UUU)<\/td><\/tr><tr><td><strong>GPT-5<\/strong><\/td><td>86.8<\/td><td>82.8 (UUU)<\/td><td>92.3 (WPP)<\/td><\/tr><tr><td><strong>Claude Sonnet 4.5<\/strong><\/td><td>79.8<\/td><td>72.2 (FFE)<\/td><td>84.7 (UOS)<\/td><\/tr><tr><td><strong>Gemini 3<\/strong><\/td><td>82.3<\/td><td>77.8 (FFE)<\/td><td>90.7 (WPP)<\/td><\/tr><tr><td><strong>Gemini 2.5<\/strong><\/td><td>101.9<\/td><td>74.1 (FFE)<\/td><td>129.0 (UOS)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Table 9: Mean RD of the LLM players across FFE, UOS, UUU and WPP score ratings.\n\n\n\n\n\nAll top-three players converged to RD values below 96 on every framework (with the exception of Creative Brain&#8217;s 107.5 on UUU). Gemini 2.5&#8217;s RD reaches 129.0 on UOS, indicating that its originality performance is especially unpredictable \u2014 consistent with its higher error rate observed during calibration (Section 4.1.1).\n\n\n\n\n\nClaude Sonnet 4.5 has the lowest average RD (79.8), meaning its performance is the&nbsp;<em>most predictable<\/em>&nbsp;of all players \u2014 it reliably delivers a certain quality level even if that ceiling is lower than GPT-5 or Creative Brain on some dimensions.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>4.6 Limitations<\/strong><\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Judge model bias.<\/strong>&nbsp;All evaluations were performed using a single LLM as the judge in each scoring sub-agent. While calibrated against human ground truth, any systematic blind spots in the underlying LLM could advantage or disadvantage specific players. Future work should include multi-judge ensembles.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Prompt parity vs. system parity.<\/strong>&nbsp;Creative Brain receives the same brief as other players but processes it through a multi-agent pipeline \u2014 it does more inference work per idea. The tournament tests&nbsp;<em>system-level<\/em>&nbsp;creative output, not cost-normalised or latency-normalised performance.<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Framework coverage.<\/strong>&nbsp;Semiotics and OSCAI were excluded from per-framework Elo analysis due to high scoring variance (Section 4.3).<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Brief diversity.<\/strong>&nbsp;14 briefs span a meaningful range of categories but may not cover all creative challenge types (e.g. non-English markets).<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Three iterations.<\/strong>&nbsp;While sufficient for Glicko-2 convergence to low RD in most cases, additional iterations would further tighten confidence intervals.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">5. Conclusions<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The <strong>Creativity Evaluation Agent<\/strong> is deployed and usable via UI and API, and it produces reliable results. The multi-framework approach improves coverage and gives more actionable feedback than a single aggregate score. The path forward includes continued validation against broader and more diverse human panels, expansion of the tournament to track how model capabilities evolve across releases, and integration of the evaluation agent directly into creative workflows \u2014 not as a post-hoc judge, but as a real-time collaborator that scores, critiques, and refines ideas within the generation loop itself.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Creative ideas are the primary driver of advertising impact, yet evaluating them at scale remains stubbornly subjective \u2014 human panels are expensive, slow, inconsistent across evaluators, and impossible to run repeatedly as the volume of AI-generated concepts grows. The core problem is that creativity is multidimensional: a single aggregate score fails to capture whether an [&hellip;]<\/p>\n","protected":false},"author":18,"featured_media":0,"template":"","meta":{"_acf_changed":false},"tags":[],"content_types":[{"id":51,"name":"Technical Report","slug":"technical-walkthrough"}],"ppma_author":[{"id":18,"display_name":"Andreas Stergioulas","first_name":"Andreas","last_name":"Stergioulas","nickname":"andreas.stergioulas","user_nicename":"andreas-stergioulas","user_email":"andreas.stergioulas@satalia.com","biographical_info":"Andreas Stergioulas is a Senior Data Scientist. He specializes in LLM-based agentic architectures, generative models, and computer vision for large-scale enterprise applications. Currently working at Satalia (WPP Group), he designs and deploys production-grade AI solutions \u2014 including custom multi-agent LLM systems and diffusion-based image generation pipelines \u2014 for globally recognized clients. He holds an M.Sc. in Electrical and Computer Engineering and is the author of peer-reviewed publications in venues such as CVPR Workshops and IEEE Transactions on Multimedia.","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/E05G03JGGFJ-U06MMTF43HT-f2cf3ee352bb-512.jpeg","job_title":"Data Scientist","is_lead":null,"display_as_researcher":null,"order_priority":null},{"id":20,"display_name":"Anastasios Stamoulakatos","first_name":"Anastasios","last_name":"Stamoulakatos","nickname":"anastasios.stamoulakatos","user_nicename":"anastasios-stamoulakatos","user_email":"anastasios.stamoulakatos@satalia.com","biographical_info":"Anastasios (Tasos) Stamoulakatos is a Data Scientist at Satalia (WPP), focusing on agentic AI solutions for marketing. His work spans multi-agent systems, RAG and GraphRAG, and image retrieval, developing scalable AI solutions from early-stage POCs to production. He holds a PhD in Applied AI and Computer Vision from the University of Strathclyde and has over four years of commercial experience across industries including marketing, agriculture, pharmaceuticals, oil and gas, and manufacturing, with a strong focus on applied research and turning complex AI into practical business value.","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/headshot_small.jpg","job_title":"Data Scientist","is_lead":null,"display_as_researcher":null,"order_priority":null}],"class_list":["post-1654","research_feed","type-research_feed","status-publish","hentry","content_type-technical-walkthrough"],"acf":{"content":"<p><!-- wp:paragraph {\"className\":\"is-style-text-annotation\"} --><\/p>\n<p class=\"is-style-text-annotation\"><em>Creative ideas are the primary driver of advertising impact, yet evaluating them at scale remains stubbornly subjective \u2014 human panels are expensive, slow, inconsistent across evaluators, and impossible to run repeatedly as the volume of AI-generated concepts grows. The core problem is that creativity is multidimensional: a single aggregate score fails to capture whether an idea is original, strategically aligned, culturally resonant, or memorable, and without a shared, repeatable rubric, teams cannot meaningfully compare outputs across models, prompts, or campaigns. To address this, we built the <strong>Creativity Evaluation Agent<\/strong>, which scores marketing ideas in parallel across six established creativity frameworks \u2014 an Internal WPP, UOS, FFE, UUU, OSCAI, and Semiotics scores \u2014 using specialised Large Language Model (LLM) sub-agents with critic-refiner loops to ensure consistency, returning dimension-level scores alongside qualitative commentary in a single structured report. Calibrated against human expert ground truth, the system achieved a scoring error as low as 0.7 points (Gemini 3) with high repeatability on the internal WPP score framework (\u03c3 \u2248 0.21), and in a 210-match tournament across 14 global brands, it reliably differentiated creative quality between five frontier models \u2014 revealing that a specialised agentic creative system consistently outperformed vanilla LLMs given the same brief, giving marketing teams a fast, interpretable, and auditable way to benchmark and iterate on creative output before committing production resources.<\/em><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>This document details the technical architecture, calibration methodology, and experimental design underlying the system built to address these gaps. For results and strategic findings, read&nbsp;our blog post instead.<\/strong><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:separator --><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator --><\/p>\n<p><!-- wp:heading {\"level\":1} --><\/p>\n<h1 class=\"wp-block-heading\">Problem statement and motivation<\/h1>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Everyone agrees creativity matters in marketing. Nobody agrees on how to measure it.<\/p>\n<p>Put the same campaign idea in front of five reviewers and you&#8217;ll get five different scores. One loves the visual metaphor, another thinks the tagline falls flat, a third is just tired after reviewing thirty concepts before lunch. The scores reflect taste and circumstance as much as they reflect the work. This is fine when you&#8217;re picking between two finalist campaigns in a boardroom \u2014 it falls apart the moment you need to evaluate at scale.<\/p>\n<p>And scale is exactly what modern marketing demands. Teams are generating more ideas than ever, increasingly with the help of generative AI. They need to&nbsp;<strong>screen hundreds of concepts quickly<\/strong>,&nbsp;<strong>understand what specifically makes one idea stronger than another<\/strong>, and&nbsp;<strong>benchmark creative output<\/strong>&nbsp;across different models, prompts, teams, and time periods. A human review panel can do the first job slowly, the second job inconsistently, and the third job barely at all\u2014while being expensive to convene every time.<\/p>\n<p>The core issues are straightforward:<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Subjectivity<\/strong>&nbsp;\u2014 without a shared rubric, two reviewers scoring the same idea can land in completely different places.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Scalability<\/strong>&nbsp;\u2014 manual evaluation doesn&#8217;t survive contact with hundreds of ideas per sprint.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Feedback quality<\/strong>&nbsp;\u2014 a score without explanation is useless for iteration; explanations vary wildly across evaluators.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Cost and repeatability<\/strong>&nbsp;\u2014 assembling expert panels is slow and expensive, and running the same panel twice doesn&#8217;t guarantee the same results.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>What&#8217;s missing is a system that can apply&nbsp;<strong>structured, reproducible, explainable<\/strong>&nbsp;creativity assessment across large volumes of work \u2014 fast enough to be useful and consistent enough to be trusted.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":1} --><\/p>\n<h1 class=\"wp-block-heading\">1. Introduction and solution overview<\/h1>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Evaluating marketing creativity at scale demands more than a single score from a single judge. The&nbsp;<strong>Creativity Evaluation Agent<\/strong>, built on&nbsp;<a href=\"https:\/\/google.github.io\/adk-docs\/\">Google&#8217;s Agent Development Kit (ADK)<\/a>, extends the established&nbsp;<em>LLM-as-a-Judge<\/em>&nbsp;paradigm by introducing a multi-agent system in which specialised sub-agents score marketing ideas across&nbsp;<strong>six complementary creativity frameworks<\/strong>, each covering a distinct slice of what practitioners consider &#8220;good creativity.&#8221;<\/p>\n<p>Every scoring sub-agent is grounded through&nbsp;<strong>few-shot examples<\/strong>&nbsp;that teach the underlying LLM how the creative dimension it owns should be measured, narrowing the gap between automated and human judgement. The system accepts&nbsp;<strong>text, image, video, and PDF<\/strong>&nbsp;inputs, runs framework evaluations in parallel, and returns dimension-level scores together with qualitative commentary. It is accessible through both an&nbsp;<strong>API<\/strong>&nbsp;and a&nbsp;<strong>web UI<\/strong>.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":1} --><\/p>\n<h1 class=\"wp-block-heading\">2. Technical approach<\/h1>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\">2.1. Architecture overview<\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>At a high level, a user&#8217;s idea is received by a&nbsp;<strong>Root Agent<\/strong>, which parses the input and routes it to a&nbsp;<strong>Dynamic Parallel Orchestrator<\/strong>. The orchestrator spins up only the scoring pipelines the user has requested, runs them concurrently, and hands their outputs to a&nbsp;<strong>Report Agent<\/strong>&nbsp;that merges everything into a single structured JSON response.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:image {\"id\":1781,\"sizeSlug\":\"large\",\"linkDestination\":\"none\"} --><\/p>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"502\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/1-1024x502.jpg\" alt=\"\" class=\"wp-image-1781\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/1-1024x502.jpg 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/1-300x147.jpg 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/1-768x377.jpg 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/1-1536x754.jpg 1536w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/1-2048x1005.jpg 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 1: Architecture of the Creativity Evaluation Agent.<\/figcaption><\/figure>\n<p><!-- \/wp:image --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\">2.2. Custom orchestration engine<\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>The core implementation is the&nbsp;<strong><code>Creativity Evaluation Agent<\/code><\/strong>, a custom ADK&nbsp;<code>BaseAgent<\/code>. Its behaviour breaks down as follows:<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Root Agent.<\/strong>&nbsp;The user-facing entry point. It can answer questions, hold a conversation, and \u2014 when the user supplies a creative idea \u2014 forward it for evaluation. It decides&nbsp;<em>which<\/em>&nbsp;frameworks to invoke based on the user&#8217;s request.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Dynamic pipeline construction.<\/strong>&nbsp;Pipelines are&nbsp;<em>not<\/em>&nbsp;built ahead of time. Based on what the user asks for, the orchestrator assembles only the relevant evaluation chains, then executes them&nbsp;<strong>in parallel<\/strong>.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Critic\u2013refiner loop.<\/strong>&nbsp;After initial scoring, each pipeline runs a bounded critic\u2013refiner cycle (up to&nbsp;<strong>two iterations<\/strong>) in which a critic agent reviews the scores for obvious errors or inconsistencies. If the critic flags an issue, the refiner adjusts before the result is finalised.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Report Agent.<\/strong>&nbsp;Once all pipelines complete, this agent compiles dimension-level scores and qualitative commentary into a single, consistently formatted output. When the user has submitted multiple ideas, the report includes a&nbsp;<strong>comparative analysis<\/strong>.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Each scoring pipeline is built around an&nbsp;<code>LlmAgent<\/code>&nbsp;instance initialised with a detailed system message encoding its evaluation lens, together with&nbsp;<strong>few-shot examples<\/strong>&nbsp;that anchor outputs close to human scoring behaviour. Scores are emitted as&nbsp;<strong>continuous values<\/strong>&nbsp;(e.g. 1.2, 3.7) rather than discrete integers, matching the granularity of the ground-truth datasets. The six pipelines are:<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>WPP Score Pipeline.<\/strong>&nbsp;Scores ideas against a proprietary WPP creativity framework built around four dimensions:<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li>how sharply the idea frames the business challenge, not just the marketing opportunity<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li>how boldly it challenges category convention and subverts clich\u00e9s<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li>how authentically the proposed solution fits the brand and resonates with the audience) and<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li>the scale of measurable growth and emotional response it is designed to deliver<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li>Each dimension is scored 1\u20133 points, composited into an index ranging 0\u201312. The framework was calibrated against real-world campaign performance across multiple brands and markets. Few-shot examples are drawn from WPP&#8217;s internal archive of historically scored campaigns.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong><a href=\"https:\/\/arxiv.org\/abs\/2510.04009\">Usefulness, Originality and Suprise<\/a> (UOS) Pipeline.<\/strong> Evaluates the classic definition of divergent creative value through three dimensions:&nbsp;<strong>Usefulness<\/strong>&nbsp;(does it solve a real problem and align with the brief&#8217;s constraints?),&nbsp;<strong>Originality<\/strong>&nbsp;(does it approach the problem in a novel way?), and&nbsp;<strong>Surprise<\/strong>&nbsp;(does it deliver an unexpected twist that captures attention?). The three dimension scores are aggregated into an overall UOS score.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong><a href=\"https:\/\/arxiv.org\/pdf\/2401.12491\">Fluency, Flexibility and Elaboration (FFE) Pipeline<\/a>.<\/strong> Quantifies the &#8220;mental engine&#8221; behind the idea through three dimensions:&nbsp;<strong>Fluency<\/strong>&nbsp;(how many distinct, relevant ideas are presented),&nbsp;<strong>Flexibility<\/strong>&nbsp;(how many different conceptual categories are explored), and&nbsp;<strong>Elaboration<\/strong>&nbsp;(how richly detailed and refined the idea is). Grounded in creativity research literature and benchmarked against marketing creativity datasets. Few-shot examples are generated using <a href=\"https:\/\/en.wikipedia.org\/wiki\/Torrance_Tests_of_Creative_Thinking\">Torrance-style<\/a> divergent thinking tasks (see Section 4.1).<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong><a href=\"https:\/\/arxiv.org\/pdf\/2509.09702\">Unique, Unexpected and Unforgettable (UUU) Pipeline<\/a>.<\/strong> Assesses brand longevity through the lens of a Creative Strategist:&nbsp;<strong>Unique<\/strong>&nbsp;(could only this idea deliver this message in this way?),&nbsp;<strong>Unexpected<\/strong>&nbsp;(does it subvert expectations and force re-evaluation?), and&nbsp;<strong>Unforgettable<\/strong>&nbsp;(does it create a defining moment that lives rent-free in the audience&#8217;s mind?). Each dimension is scored as a continuous value and averaged into an overall UUU score.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong><a href=\"https:\/\/www.sciencedirect.com\/science\/article\/abs\/pii\/S1871187123001256?via%3Dihub\">OSCAI<\/a> Pipeline.<\/strong>&nbsp;A two-stage pipeline for measuring conceptual distance. First, a sub-agent extracts&nbsp;<strong>semantic relations<\/strong>&nbsp;from the idea (e.g.,&nbsp;<em>man \u2192 eats \u2192 apple<\/em>). Those relations are sent to the&nbsp;<strong>OSCAI API<\/strong>, maintained by the framework&#8217;s original authors, which scores each relation&#8217;s originality \u2014 distinguishing mundane links (<em>a chef cooks dinner<\/em>) from highly original relationships (<em>a clown teaches mathematics<\/em>). The returned scores quantify the creative leap at the heart of the idea.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Semiotics Pipeline.<\/strong>&nbsp;Applies the <a href=\"https:\/\/link.springer.com\/chapter\/10.1007\/978-1-4757-9700-8_3\">Saussurean principles<\/a> of sign systems to decode how meaning is constructed through cultural symbols. The sub-agent analyses&nbsp;<strong>Denotation<\/strong>&nbsp;(literal content),&nbsp;<strong>Connotation<\/strong>&nbsp;(implied meaning),&nbsp;<strong>Myth<\/strong>&nbsp;(cultural narratives reinforced or challenged), the&nbsp;<strong>Semiotic Relation<\/strong>&nbsp;(additive, contradictory, etc.),&nbsp;<strong>Risks or tensions<\/strong>, and produces a&nbsp;<strong>Semiotic Coherence Score<\/strong>&nbsp;(0\u20133). Unlike the other pipelines, no few-shot examples are used \u2014 evaluation relies on the model&#8217;s inherent understanding of semiotic theory.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>2.3. Score normalization<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>The six frameworks operate on different native scales:<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":false} --><\/p>\n<figure class=\"wp-block-table\">\n<table>\n<thead>\n<tr>\n<th><strong>Framework<\/strong><\/th>\n<th><strong>Native scoring<\/strong><\/th>\n<th><strong>Range<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>WPP Score<\/td>\n<td>Sum of 4 dimensions, each 0\u20133<\/td>\n<td>0\u201312<\/td>\n<\/tr>\n<tr>\n<td>FFE<\/td>\n<td>Average of 3 dimensions, each 0\u20133<\/td>\n<td>0\u20133<\/td>\n<\/tr>\n<tr>\n<td>UOS<\/td>\n<td>Average of 3 dimensions, each 0\u20135<\/td>\n<td>0\u20135<\/td>\n<\/tr>\n<tr>\n<td>UUU<\/td>\n<td>Average of 3 dimensions, each 0\u20135<\/td>\n<td>0\u20135<\/td>\n<\/tr>\n<tr>\n<td>OSCAI<\/td>\n<td>Single score<\/td>\n<td>0\u20135<\/td>\n<\/tr>\n<tr>\n<td>Semiotics<\/td>\n<td>Coherence score<\/td>\n<td>0\u20133<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Table 1: Creativity scores and their range.<\/p>\n<p>Direct comparison or summation across frameworks is misleading without normalisation. The following procedure is applied:<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:list {\"ordered\":true} --><\/p>\n<ol class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Convert sums to averages.<\/strong>&nbsp;The WPP Score (a sum of four 0\u20133 dimensions) is divided by 4 to produce a 0\u20133 average, making it structurally comparable to other averaged scores.<\/li>\n<p><!-- \/wp:list-item --><\/ol>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list {\"ordered\":true} --><\/p>\n<ol class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Rescale to a common 0\u201310 range.<\/strong>&nbsp;Each framework&#8217;s score is divided by its native maximum and multiplied by 10: <code>normalised_score = (raw_score \/ max_score) \u00d7 10<\/code><\/li>\n<p><!-- \/wp:list-item --><\/ol>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list {\"ordered\":true} --><\/p>\n<ol class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Composite total.<\/strong>&nbsp;The 4 normalised pillar scores (WPP, FFE, UOS, UUU) are summed into a composite total with a&nbsp;<strong>maximum of 40<\/strong>, each pillar contributing equally.<\/li>\n<p><!-- \/wp:list-item --><\/ol>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>OSCAI and Semiotics<\/strong>&nbsp;are reported as standalone scores and are&nbsp;<strong>not<\/strong>&nbsp;included in the composite total. This decision was made because OSCAI depends on an external API with different reliability characteristics, and both OSCAI and Semiotics showed higher inter-run variability (\u03c3 \u2248 0.80, see Section 4.3), which would add noise to the composite.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>2.4. Infrastructure and deployment<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":false} --><\/p>\n<figure class=\"wp-block-table\">\n<table>\n<thead>\n<tr>\n<th><strong>Concern<\/strong><\/th>\n<th><strong>Technology<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Multi-agent orchestration<\/td>\n<td><a href=\"https:\/\/google.github.io\/adk-docs\/\">ADK<\/a><\/td>\n<\/tr>\n<tr>\n<td>Compute<\/td>\n<td><a href=\"https:\/\/cloud.google.com\/run\">Google Cloud Run<\/a><\/td>\n<\/tr>\n<tr>\n<td>LLM inference<\/td>\n<td><a href=\"https:\/\/cloud.google.com\/vertex-ai\">Vertex AI<\/a>&nbsp;\u2014&nbsp;<strong>Gemini 3 Pro<\/strong>&nbsp;as the primary judge model<\/td>\n<\/tr>\n<tr>\n<td>Observability<\/td>\n<td><a href=\"https:\/\/cloud.google.com\/logging\">Cloud Logging<\/a>&nbsp;+&nbsp;<a href=\"https:\/\/docs.cloud.google.com\/trace\/docs\">Cloud Trace<\/a><\/td>\n<\/tr>\n<tr>\n<td>Container registry<\/td>\n<td><a href=\"https:\/\/docs.cloud.google.com\/artifact-registry\/docs\/overview\">Artifact Registry<\/a><\/td>\n<\/tr>\n<tr>\n<td>Agent-to-agent protocol<\/td>\n<td><a href=\"https:\/\/a2a-protocol.org\/latest\/\">A2A<\/a><\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Table 2: Employed Google teck stack.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading --><\/p>\n<h2 class=\"wp-block-heading\">3. Ground truth data and system evaluation<\/h2>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>3.1 Dataset overview<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Reliable automated scoring requires credible ground truth. Because no single public dataset covers all six frameworks, a combination of&nbsp;<strong>historical data<\/strong>&nbsp;and&nbsp;<strong>synthetically generated ground truth<\/strong>&nbsp;was used.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>3.2. WPP score \u2014 historical human judgements<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>The ground-truth dataset comes from&nbsp;<strong>WPP&#8217;s internal archive<\/strong>&nbsp;of marketing campaign ideas submitted between 2020 and 2023. Creative professionals scored each idea across the WPP dimensions. From this corpus,&nbsp;<strong>6 scored ideas<\/strong>&nbsp;were selected as few-shot examples for the scoring sub-agent and&nbsp;<strong>10 additional ideas<\/strong>&nbsp;were reserved for its critic agent.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>3.3. FFE \u2014 synthetic data via Torrance-style tasks<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Few-shot examples for the FFE framework were generated using&nbsp;<strong>Gemini<\/strong>&nbsp;prompted with tasks modelled on the&nbsp;<a href=\"https:\/\/psycnet.apa.org\/doiLanding?doi=10.1037%2Ft05532-000\">Torrance Tests of Creative Thinking<\/a>. Each example pairs a divergent-thinking task, a response, a score, and a justification. For instance:<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:quote --><\/p>\n<blockquote class=\"wp-block-quote\"><p><!-- wp:paragraph --><\/p>\n<p><strong>Example 1 (Score 0):<\/strong><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Task:<\/strong>&nbsp;Please list unusual uses of a plastic bottle.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Response:<\/strong>&nbsp;1. Plant a seed in it. 2. Use it to water plants by poking holes. 3. Cut it in half to make a small planter. 4. Use it to store extra fertiliser.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Justification:<\/strong>&nbsp;All ideas fall under a single, narrow category (Gardening \/ Horticulture). No conceptual shift is demonstrated.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p><\/blockquote>\n<p><!-- \/wp:quote --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>3.4. UOS &amp; UUU \u2014 community-sourced creative writing<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>No pre-existing ground truth was available for these two frameworks, so it was constructed in three steps:<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:list {\"ordered\":true} --><\/p>\n<ol class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Source corpus.<\/strong>&nbsp;The&nbsp;<a href=\"https:\/\/huggingface.co\/datasets\/euclaise\/WritingPrompts_preferences\">Creative Storytelling dataset<\/a>&nbsp;(stories from r\/WritingPrompts on Hugging Face) was used as raw material.<\/li>\n<p><!-- \/wp:list-item --><\/ol>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list {\"ordered\":true} --><\/p>\n<ol class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Quality stratification.<\/strong>&nbsp;Stories were sorted by upvotes;&nbsp;<strong>11 highly upvoted<\/strong>&nbsp;and&nbsp;<strong>11 low-voted<\/strong>&nbsp;examples were selected to represent the ends of the quality spectrum.<\/li>\n<p><!-- \/wp:list-item --><\/ol>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list {\"ordered\":true} --><\/p>\n<ol class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Automated annotation.<\/strong>&nbsp;These 22 stories, together with the formal definitions of the UOS and UUU dimensions, were fed to Gemini, which produced scored examples that serve as the few-shot ground truth for both frameworks.<\/li>\n<p><!-- \/wp:list-item --><\/ol>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>3.5. Scoring format<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>All scoring sub-agents output&nbsp;<strong>continuous values<\/strong>&nbsp;(e.g. 1.2, 2.8) rather than rounding to integers. This decision was made to stay consistent with the WPP ground-truth scores, which are themselves continuous, and to preserve finer-grained distinctions between ideas.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\">3.6. Variability analysis<\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Reliability was assessed by scoring 55 campaign ideas (one per model) for a popular beverage brand, three times each. Each of the 5 ideas (one per AI model) was rated 3 times by the benchmark agent, and the standard deviation (std) across those 3 runs was computed per score. Each bar shows the average std across all 5 models, so taller bars mean the agent scores that dimension less consistently.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li>FFE (Fluency, Flexibility and Elaboration) score shows moderate variability with average std ~ 0.3. Further examination of each constituent creativity aspect evaluated by the FFE score, showed that the deviation is skewed because of Fluency\u2019s variance. This can be due to how Fluency is defined, which is \u201cEvaluate how many distinct, relevant ideas or solutions are presented. Count only meaningful and contextually appropriate ones (avoid repetition or vague statements).\u201d \u2014 an inherently count-based metric where the boundary between &#8220;distinct&#8221; and &#8220;overlapping&#8221; ideas introduces subjective judgment for an LLM.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li>OSCAI &amp; Semiotics show moderate variability with average std ~0.8, which directly motivated their exclusion from the composite tournament score.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:image {\"id\":1783,\"sizeSlug\":\"large\",\"linkDestination\":\"none\"} --><\/p>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"614\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/2_small-1024x614.png\" alt=\"\" class=\"wp-image-1783\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/2_small-1024x614.png 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/2_small-300x180.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/2_small-768x461.png 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/2_small.png 1200w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 2: Average Score Variability of each scoring sub agent.<\/figcaption><\/figure>\n<p><!-- \/wp:image --><\/p>\n<p><!-- wp:image {\"id\":1784,\"sizeSlug\":\"large\",\"linkDestination\":\"none\"} --><\/p>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"639\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/3_small-1024x639.png\" alt=\"\" class=\"wp-image-1784\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/3_small-1024x639.png 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/3_small-300x187.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/3_small-768x479.png 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/3_small.png 1200w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">Figure 3: Score variability for the Fluency, Flexibility and Elaboration aspects that the FFE score measures.<\/figcaption><\/figure>\n<p><!-- \/wp:image --><\/p>\n<p><!-- wp:heading --><\/p>\n<h2 class=\"wp-block-heading\"><strong>4. LLM evaluation tournament<\/strong><\/h2>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Full tournament results and key findings are covered in the blog post. This section documents the experimental design, technical implementation, per-framework results, and supplementary analysis.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>4.1 Experimental design<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Players.<\/strong><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":false} --><\/p>\n<figure class=\"wp-block-table\">\n<table>\n<thead>\n<tr>\n<th><strong>Player<\/strong><\/th>\n<th><strong>Description<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>GPT-5<\/strong><\/td>\n<td>Standalone, standardised prompt<\/td>\n<\/tr>\n<tr>\n<td><strong>Gemini 3<\/strong><\/td>\n<td>Standalone, standardised prompt<\/td>\n<\/tr>\n<tr>\n<td><strong>Gemini 2.5<\/strong><\/td>\n<td>Standalone, standardised prompt<\/td>\n<\/tr>\n<tr>\n<td><strong>Claude Sonnet 4.5<\/strong><\/td>\n<td>Standalone, standardised prompt<\/td>\n<\/tr>\n<tr>\n<td><strong>Creative Brain<\/strong><\/td>\n<td>WPP&#8217;s multi-agent ideation system, built on Gemini 3<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Table 3: The LLMs that took part in the evaluation tournament.<\/p>\n<p><strong>Prompt design.<\/strong>&nbsp;Four standalone models received a standardised, neutral prompt to ensure a level playing field:<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:quote --><\/p>\n<blockquote class=\"wp-block-quote\"><p><!-- wp:paragraph --><\/p>\n<p><em>&#8220;Give me a creative marketing idea\/campaign based on the brief. Your output must have a title and three sections: Challenge, Core Idea, and Execution.&#8221;<\/em><\/p>\n<p><!-- \/wp:paragraph --><\/p><\/blockquote>\n<p><!-- \/wp:quote --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>The&nbsp;<strong>Creative Brain<\/strong>&nbsp;received the same brief but processed it through its own multi-agent ideation pipeline \u2014 testing whether agentic orchestration outperforms raw model capability given identical inputs.<\/p>\n<p><strong>Briefs.<\/strong>&nbsp;Each model generated ideas for&nbsp;<strong>14 global brands<\/strong>&nbsp;spanning different categories and creative challenges.<\/p>\n<p><strong>Iterations.<\/strong>&nbsp;Each model\u2013brand combination was run&nbsp;<strong>3 times<\/strong>, producing independent idea generations to account for output variance.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>4.2. Implementation details<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Scoring.<\/strong>&nbsp;Gemini 3 was selected as the LLM that powered our scoring sub-agents. Rather than relying on side-by-side LLM comparisons (which can be inconsistent), the Creativity Evaluation Agent judged every idea&nbsp;<strong>independently<\/strong>, generating a structured creativity report with raw scores across all frameworks. This independent-scoring approach means each idea has a self-contained evaluation record that can be compared post hoc, eliminating ordering effects that plague pairwise LLM judging.<\/p>\n<p><strong>Match simulation.<\/strong>&nbsp;An orchestration engine simulated head-to-head matches by computing the&nbsp;<strong>normalised score average<\/strong>&nbsp;across all evaluated frameworks for each idea on the same brief. Normalisation was applied per-framework to prevent any single framework from dominating (e.g., WPP scores range 0\u201312 while UUU averages range 1\u20135). For each brief, every pair of models was matched: the model with the higher normalised average won the match, the other lost. Draws were not permitted; in the event of an exact tie on normalised average, the match was recorded as a draw in <a href=\"https:\/\/www.notion.so\/Creativity-evaluation-agent-technical-summary-3302494e8bd9808e9260ec014c474de2?pvs=21\">Glicko-2<\/a> (outcome = 0.5).<\/p>\n<p><strong>Glicko-2 parameters.<\/strong><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":false} --><\/p>\n<figure class=\"wp-block-table\">\n<table>\n<thead>\n<tr>\n<th><strong>Parameter<\/strong><\/th>\n<th><strong>Value<\/strong><\/th>\n<th><strong>Rationale<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>Initial rating (\u03bc\u2080)<\/td>\n<td>1500<\/td>\n<td>Standard Glicko-2 default<\/td>\n<\/tr>\n<tr>\n<td>Initial rating deviation (RD\u2080)<\/td>\n<td>350<\/td>\n<td>Standard Glicko-2 default; reflects maximum uncertainty<\/td>\n<\/tr>\n<tr>\n<td>System volatility (\u03c3)<\/td>\n<td>0.06<\/td>\n<td>Standard default; controls expected rating fluctuation per period<\/td>\n<\/tr>\n<tr>\n<td>Convergence tolerance (\u03c4)<\/td>\n<td>0.000001<\/td>\n<td>For the iterative volatility update step<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Table 4: Glicko-2 parameter initialisation.<\/p>\n<p>Ratings were updated after each complete round-robin cycle across all 14 briefs before proceeding to the next iteration. This means each &#8220;rating period&#8221; contained C(5,2) \u00d7 14 =&nbsp;<strong>140 matches<\/strong>&nbsp;(every pair of 5 models on every brief), and three rating periods were processed in sequence for the three iterations.<\/p>\n<p><strong>Scale.<\/strong>&nbsp;Total matches: 3 iterations \u00d7 10 pairs \u00d7 14 briefs \u00d7 (1 match per pair-brief) =&nbsp;<strong>210 unique creative matches<\/strong>&nbsp;across the tournament per ranking method. For per-framework rankings, the same 210-match structure was applied but using the single-framework score (normalised) rather than the cross-framework composite.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>4.3 Per-framework leaderboards<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>To understand&nbsp;<em>where<\/em>&nbsp;each model&#8217;s strengths and weaknesses lie, the same Glicko-2 tournament was run using each individual evaluation framework&#8217;s scores as the match-outcome criterion. The results reveal meaningfully different competitive profiles across creative dimensions. We omit the WPP and the aggregate Elo scores since they are available in the executive summary. Additionally we omit the Semiotics and OSCAI Elo scores due to their high scoring variance.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>4.3.1 FFE (Fluency, Flexibility, Elaboration)<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":false} --><\/p>\n<figure class=\"wp-block-table\">\n<table>\n<thead>\n<tr>\n<th><strong>Rank<\/strong><\/th>\n<th><strong>Player<\/strong><\/th>\n<th><strong>Rating<\/strong><\/th>\n<th><strong>RD<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>\ud83e\udd47<\/td>\n<td>GPT-5<\/td>\n<td>1895<\/td>\n<td>88.4<\/td>\n<\/tr>\n<tr>\n<td>\ud83e\udd48<\/td>\n<td><strong>Creative Brain<\/strong>&nbsp;(Gemini 3)<\/td>\n<td>1651<\/td>\n<td>71.8<\/td>\n<\/tr>\n<tr>\n<td>\ud83e\udd49<\/td>\n<td>Claude Sonnet 4.5<\/td>\n<td>1583<\/td>\n<td>72.2<\/td>\n<\/tr>\n<tr>\n<td>4<\/td>\n<td>Gemini 3<\/td>\n<td>1358<\/td>\n<td>77.8<\/td>\n<\/tr>\n<tr>\n<td>5<\/td>\n<td>Gemini 2.5<\/td>\n<td>1285<\/td>\n<td>74.1<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Table 5: Elo ratings on FFE score.<\/p>\n<p>GPT-5 leads comfortably on FFE metrics. The gap between Creative Brain and Claude Sonnet 4.5 is narrow (~68 points), suggesting comparable idea elaboration depth. All Rating deviation (RD) values are below 89, indicating stable ratings.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>4.3.2 UOS (Uniqueness, Originality, Surprise)<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":false} --><\/p>\n<figure class=\"wp-block-table\">\n<table>\n<thead>\n<tr>\n<th><strong>Rank<\/strong><\/th>\n<th><strong>Player<\/strong><\/th>\n<th><strong>Rating<\/strong><\/th>\n<th><strong>RD<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>\ud83e\udd47<\/td>\n<td><strong>Creative Brain<\/strong>&nbsp;(Gemini 3)<\/td>\n<td>2006<\/td>\n<td>95.5<\/td>\n<\/tr>\n<tr>\n<td>\ud83e\udd48<\/td>\n<td>GPT-5<\/td>\n<td>1657<\/td>\n<td>83.6<\/td>\n<\/tr>\n<tr>\n<td>\ud83e\udd49<\/td>\n<td>Gemini 3<\/td>\n<td>1567<\/td>\n<td>81.7<\/td>\n<\/tr>\n<tr>\n<td>4<\/td>\n<td>Claude Sonnet 4.5<\/td>\n<td>1283<\/td>\n<td>84.7<\/td>\n<\/tr>\n<tr>\n<td>5<\/td>\n<td>Gemini 2.5<\/td>\n<td>1009<\/td>\n<td>129.0<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Table 6: Elo ratings on UOS score.<\/p>\n<p>Creative Brain dominates originality, with a&nbsp;<strong>349-point lead<\/strong>&nbsp;over GPT-5 \u2014 the widest gap between the top two players in any framework. Notably, standalone Gemini 3 ranks 3rd here (above Claude Sonnet 4.5), suggesting the base model has latent originality that Creative Brain&#8217;s orchestration amplifies dramatically. Gemini 2.5&#8217;s elevated RD (129.0) indicates volatile originality performance.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>4.3.3 UUU (Unexpected, Useful, Ultra-specific)<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":false} --><\/p>\n<figure class=\"wp-block-table\">\n<table>\n<thead>\n<tr>\n<th><strong>Rank<\/strong><\/th>\n<th><strong>Player<\/strong><\/th>\n<th><strong>Rating<\/strong><\/th>\n<th><strong>RD<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>\ud83e\udd47<\/td>\n<td><strong>Creative Brain<\/strong>&nbsp;(Gemini 3)<\/td>\n<td>2028<\/td>\n<td>107.5<\/td>\n<\/tr>\n<tr>\n<td>\ud83e\udd48<\/td>\n<td>GPT-5<\/td>\n<td>1653<\/td>\n<td>82.8<\/td>\n<\/tr>\n<tr>\n<td>\ud83e\udd49<\/td>\n<td>Gemini 3<\/td>\n<td>1441<\/td>\n<td>79.0<\/td>\n<\/tr>\n<tr>\n<td>4<\/td>\n<td>Claude Sonnet 4.5<\/td>\n<td>1377<\/td>\n<td>78.2<\/td>\n<\/tr>\n<tr>\n<td>5<\/td>\n<td>Gemini 2.5<\/td>\n<td>958<\/td>\n<td>100.2<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Table 7: Elo ratings on UUU score.<\/p>\n<p>Creative Brain achieves its highest absolute rating (2028) on UUU \u2014 a&nbsp;<strong>375-point lead<\/strong>&nbsp;over GPT-5. This framework rewards ideas that are simultaneously surprising&nbsp;<em>and<\/em>&nbsp;actionable, which aligns with the multi-agent pipeline&#8217;s design goal: push for unexpected angles while grounding them in executable detail. Creative Brain&#8217;s slightly elevated RD (107.5) suggests occasional variance, but the margin is decisive.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>4.4 Cross-framework analysis<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>The per-framework breakdowns reveal distinct competitive profiles:<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":false} --><\/p>\n<figure class=\"wp-block-table\">\n<table>\n<thead>\n<tr>\n<th><strong>Player<\/strong><\/th>\n<th><strong>FFE<\/strong><\/th>\n<th><strong>UOS<\/strong><\/th>\n<th><strong>UUU<\/strong><\/th>\n<th><strong>WPP score<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Creative Brain<\/strong><\/td>\n<td>1651 (2nd)<\/td>\n<td><strong>2006 (1st)<\/strong><\/td>\n<td><strong>2028 (1st)<\/strong><\/td>\n<td><strong>1889 (1st)<\/strong><\/td>\n<\/tr>\n<tr>\n<td><strong>GPT-5<\/strong><\/td>\n<td><strong>1895 (1st)<\/strong><\/td>\n<td>1657 (2nd)<\/td>\n<td>1653 (2nd)<\/td>\n<td>1858 (2nd)<\/td>\n<\/tr>\n<tr>\n<td><strong>Claude Sonnet 4.5<\/strong><\/td>\n<td>1583 (3rd)<\/td>\n<td>1283 (4th)<\/td>\n<td>1377 (4th)<\/td>\n<td>1529 (3rd)<\/td>\n<\/tr>\n<tr>\n<td><strong>Gemini 3<\/strong><\/td>\n<td>1358 (4th)<\/td>\n<td>1567 (3rd)<\/td>\n<td>1441 (3rd)<\/td>\n<td>1169 (4th)<\/td>\n<\/tr>\n<tr>\n<td><strong>Gemini 2.5<\/strong><\/td>\n<td>1285 (5th)<\/td>\n<td>1009 (5th)<\/td>\n<td>958 (5th)<\/td>\n<td>962 (5th)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Table 8: Aggregate Elo ratings.<\/p>\n<p><strong>Key patterns:<\/strong><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Creative Brain&#8217;s advantage is most scores.<\/strong>&nbsp;It ranks 1st on UOS, UUU, and WPP score while it drops to 2nd on FFE.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>GPT-5 is the second best when it comes to creativity.<\/strong>&nbsp;It ranks 1st or 2nd on every single framework. Its weakest showing is 2nd place on UOS, UUU and WPP score, behind Creative Brain.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Claude Sonnet 4.5 has a spiked profile.<\/strong>&nbsp;Competitive on FFE (3rd, close to Creative Brain), but drops to 4th on UOS and UUU. This suggests its outputs are well-elaborated but less likely to produce unexpected or surprising creative leaps.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Gemini 3 benefits substantially from the complex orchestration that Creative Brain introduces.<\/strong>&nbsp;Across every framework, Creative Brain outperforms standalone Gemini 3 \u2014 the smallest gap is ~293 points (FFE) and the largest is ~720 points (WPP score).<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>4.5 Rating deviation &amp; confidence<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Rating deviation (RD) indicates how confident the system is in each player&#8217;s rating \u2014 lower RD means more predictable performance and a more stable estimate.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":false} --><\/p>\n<figure class=\"wp-block-table\">\n<table>\n<thead>\n<tr>\n<th><strong>Player<\/strong><\/th>\n<th><strong>Avg RD<\/strong><\/th>\n<th><strong>Min RD<\/strong><\/th>\n<th><strong>Max RD<\/strong><\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Creative Brain<\/strong><\/td>\n<td>89.9<\/td>\n<td>71.8 (FFE)<\/td>\n<td>107.5 (UUU)<\/td>\n<\/tr>\n<tr>\n<td><strong>GPT-5<\/strong><\/td>\n<td>86.8<\/td>\n<td>82.8 (UUU)<\/td>\n<td>92.3 (WPP)<\/td>\n<\/tr>\n<tr>\n<td><strong>Claude Sonnet 4.5<\/strong><\/td>\n<td>79.8<\/td>\n<td>72.2 (FFE)<\/td>\n<td>84.7 (UOS)<\/td>\n<\/tr>\n<tr>\n<td><strong>Gemini 3<\/strong><\/td>\n<td>82.3<\/td>\n<td>77.8 (FFE)<\/td>\n<td>90.7 (WPP)<\/td>\n<\/tr>\n<tr>\n<td><strong>Gemini 2.5<\/strong><\/td>\n<td>101.9<\/td>\n<td>74.1 (FFE)<\/td>\n<td>129.0 (UOS)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Table 9: Mean RD of the LLM players across FFE, UOS, UUU and WPP score ratings.<\/p>\n<p>All top-three players converged to RD values below 96 on every framework (with the exception of Creative Brain&#8217;s 107.5 on UUU). Gemini 2.5&#8217;s RD reaches 129.0 on UOS, indicating that its originality performance is especially unpredictable \u2014 consistent with its higher error rate observed during calibration (Section 4.1.1).<\/p>\n<p>Claude Sonnet 4.5 has the lowest average RD (79.8), meaning its performance is the&nbsp;<em>most predictable<\/em>&nbsp;of all players \u2014 it reliably delivers a certain quality level even if that ceiling is lower than GPT-5 or Creative Brain on some dimensions.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\"><strong>4.6 Limitations<\/strong><\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Judge model bias.<\/strong>&nbsp;All evaluations were performed using a single LLM as the judge in each scoring sub-agent. While calibrated against human ground truth, any systematic blind spots in the underlying LLM could advantage or disadvantage specific players. Future work should include multi-judge ensembles.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Prompt parity vs. system parity.<\/strong>&nbsp;Creative Brain receives the same brief as other players but processes it through a multi-agent pipeline \u2014 it does more inference work per idea. The tournament tests&nbsp;<em>system-level<\/em>&nbsp;creative output, not cost-normalised or latency-normalised performance.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Framework coverage.<\/strong>&nbsp;Semiotics and OSCAI were excluded from per-framework Elo analysis due to high scoring variance (Section 4.3).<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Brief diversity.<\/strong>&nbsp;14 briefs span a meaningful range of categories but may not cover all creative challenge types (e.g. non-English markets).<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Three iterations.<\/strong>&nbsp;While sufficient for Glicko-2 convergence to low RD in most cases, additional iterations would further tighten confidence intervals.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:heading --><\/p>\n<h2 class=\"wp-block-heading\">5. Conclusions<\/h2>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>The <strong>Creativity Evaluation Agent<\/strong> is deployed and usable via UI and API, and it produces reliable results. The multi-framework approach improves coverage and gives more actionable feedback than a single aggregate score. The path forward includes continued validation against broader and more diverse human panels, expansion of the tournament to track how model capabilities evolve across releases, and integration of the evaluation agent directly into creative workflows \u2014 not as a post-hoc judge, but as a real-time collaborator that scores, critiques, and refines ideas within the generation loop itself.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n","related_pods":[1454],"content_quarter":"Q1 2026"},"research_categories":[],"raw_acf":{"content":"<!-- wp:paragraph {\"className\":\"is-style-text-annotation\"} -->\n<p class=\"is-style-text-annotation\"><em>Creative ideas are the primary driver of advertising impact, yet evaluating them at scale remains stubbornly subjective \u2014 human panels are expensive, slow, inconsistent across evaluators, and impossible to run repeatedly as the volume of AI-generated concepts grows. The core problem is that creativity is multidimensional: a single aggregate score fails to capture whether an idea is original, strategically aligned, culturally resonant, or memorable, and without a shared, repeatable rubric, teams cannot meaningfully compare outputs across models, prompts, or campaigns. To address this, we built the <strong>Creativity Evaluation Agent<\/strong>, which scores marketing ideas in parallel across six established creativity frameworks \u2014 an Internal WPP, UOS, FFE, UUU, OSCAI, and Semiotics scores \u2014 using specialised Large Language Model (LLM) sub-agents with critic-refiner loops to ensure consistency, returning dimension-level scores alongside qualitative commentary in a single structured report. Calibrated against human expert ground truth, the system achieved a scoring error as low as 0.7 points (Gemini 3) with high repeatability on the internal WPP score framework (\u03c3 \u2248 0.21), and in a 210-match tournament across 14 global brands, it reliably differentiated creative quality between five frontier models \u2014 revealing that a specialised agentic creative system consistently outperformed vanilla LLMs given the same brief, giving marketing teams a fast, interpretable, and auditable way to benchmark and iterate on creative output before committing production resources.<\/em><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>This document details the technical architecture, calibration methodology, and experimental design underlying the system built to address these gaps. For results and strategic findings, read&nbsp;our blog post instead.<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:separator -->\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator -->\n\n<!-- wp:heading {\"level\":1} -->\n<h1 class=\"wp-block-heading\">Problem statement and motivation<\/h1>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>Everyone agrees creativity matters in marketing. Nobody agrees on how to measure it.\n\n\n\n\n\nPut the same campaign idea in front of five reviewers and you'll get five different scores. One loves the visual metaphor, another thinks the tagline falls flat, a third is just tired after reviewing thirty concepts before lunch. The scores reflect taste and circumstance as much as they reflect the work. This is fine when you're picking between two finalist campaigns in a boardroom \u2014 it falls apart the moment you need to evaluate at scale.\n\n\n\n\n\nAnd scale is exactly what modern marketing demands. Teams are generating more ideas than ever, increasingly with the help of generative AI. They need to&nbsp;<strong>screen hundreds of concepts quickly<\/strong>,&nbsp;<strong>understand what specifically makes one idea stronger than another<\/strong>, and&nbsp;<strong>benchmark creative output<\/strong>&nbsp;across different models, prompts, teams, and time periods. A human review panel can do the first job slowly, the second job inconsistently, and the third job barely at all\u2014while being expensive to convene every time.\n\n\n\n\n\nThe core issues are straightforward:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Subjectivity<\/strong>&nbsp;\u2014 without a shared rubric, two reviewers scoring the same idea can land in completely different places.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Scalability<\/strong>&nbsp;\u2014 manual evaluation doesn't survive contact with hundreds of ideas per sprint.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Feedback quality<\/strong>&nbsp;\u2014 a score without explanation is useless for iteration; explanations vary wildly across evaluators.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Cost and repeatability<\/strong>&nbsp;\u2014 assembling expert panels is slow and expensive, and running the same panel twice doesn't guarantee the same results.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:paragraph -->\n<p>What's missing is a system that can apply&nbsp;<strong>structured, reproducible, explainable<\/strong>&nbsp;creativity assessment across large volumes of work \u2014 fast enough to be useful and consistent enough to be trusted.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":1} -->\n<h1 class=\"wp-block-heading\">1. Introduction and solution overview<\/h1>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>Evaluating marketing creativity at scale demands more than a single score from a single judge. The&nbsp;<strong>Creativity Evaluation Agent<\/strong>, built on&nbsp;<a href=\"https:\/\/google.github.io\/adk-docs\/\">Google's Agent Development Kit (ADK)<\/a>, extends the established&nbsp;<em>LLM-as-a-Judge<\/em>&nbsp;paradigm by introducing a multi-agent system in which specialised sub-agents score marketing ideas across&nbsp;<strong>six complementary creativity frameworks<\/strong>, each covering a distinct slice of what practitioners consider \"good creativity.\"\n\n\n\n\n\nEvery scoring sub-agent is grounded through&nbsp;<strong>few-shot examples<\/strong>&nbsp;that teach the underlying LLM how the creative dimension it owns should be measured, narrowing the gap between automated and human judgement. The system accepts&nbsp;<strong>text, image, video, and PDF<\/strong>&nbsp;inputs, runs framework evaluations in parallel, and returns dimension-level scores together with qualitative commentary. It is accessible through both an&nbsp;<strong>API<\/strong>&nbsp;and a&nbsp;<strong>web UI<\/strong>.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":1} -->\n<h1 class=\"wp-block-heading\">2. Technical approach<\/h1>\n<!-- \/wp:heading -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\">2.1. Architecture overview<\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>At a high level, a user's idea is received by a&nbsp;<strong>Root Agent<\/strong>, which parses the input and routes it to a&nbsp;<strong>Dynamic Parallel Orchestrator<\/strong>. The orchestrator spins up only the scoring pipelines the user has requested, runs them concurrently, and hands their outputs to a&nbsp;<strong>Report Agent<\/strong>&nbsp;that merges everything into a single structured JSON response.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:image {\"id\":1781,\"sizeSlug\":\"large\",\"linkDestination\":\"none\"} -->\n<figure class=\"wp-block-image size-large\"><img src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/1-1024x502.jpg\" alt=\"\" class=\"wp-image-1781\"\/><figcaption class=\"wp-element-caption\">Figure 1: Architecture of the Creativity Evaluation Agent.<\/figcaption><\/figure>\n<!-- \/wp:image -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\">2.2. Custom orchestration engine<\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>The core implementation is the&nbsp;<strong><code>Creativity Evaluation Agent<\/code><\/strong>, a custom ADK&nbsp;<code>BaseAgent<\/code>. Its behaviour breaks down as follows:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Root Agent.<\/strong>&nbsp;The user-facing entry point. It can answer questions, hold a conversation, and \u2014 when the user supplies a creative idea \u2014 forward it for evaluation. It decides&nbsp;<em>which<\/em>&nbsp;frameworks to invoke based on the user's request.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Dynamic pipeline construction.<\/strong>&nbsp;Pipelines are&nbsp;<em>not<\/em>&nbsp;built ahead of time. Based on what the user asks for, the orchestrator assembles only the relevant evaluation chains, then executes them&nbsp;<strong>in parallel<\/strong>.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Critic\u2013refiner loop.<\/strong>&nbsp;After initial scoring, each pipeline runs a bounded critic\u2013refiner cycle (up to&nbsp;<strong>two iterations<\/strong>) in which a critic agent reviews the scores for obvious errors or inconsistencies. If the critic flags an issue, the refiner adjusts before the result is finalised.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Report Agent.<\/strong>&nbsp;Once all pipelines complete, this agent compiles dimension-level scores and qualitative commentary into a single, consistently formatted output. When the user has submitted multiple ideas, the report includes a&nbsp;<strong>comparative analysis<\/strong>.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:paragraph -->\n<p>Each scoring pipeline is built around an&nbsp;<code>LlmAgent<\/code>&nbsp;instance initialised with a detailed system message encoding its evaluation lens, together with&nbsp;<strong>few-shot examples<\/strong>&nbsp;that anchor outputs close to human scoring behaviour. Scores are emitted as&nbsp;<strong>continuous values<\/strong>&nbsp;(e.g. 1.2, 3.7) rather than discrete integers, matching the granularity of the ground-truth datasets. The six pipelines are:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>WPP Score Pipeline.<\/strong>&nbsp;Scores ideas against a proprietary WPP creativity framework built around four dimensions:<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li>how sharply the idea frames the business challenge, not just the marketing opportunity<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li>how boldly it challenges category convention and subverts clich\u00e9s<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li>how authentically the proposed solution fits the brand and resonates with the audience) and<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li>the scale of measurable growth and emotional response it is designed to deliver<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list --><\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li>Each dimension is scored 1\u20133 points, composited into an index ranging 0\u201312. The framework was calibrated against real-world campaign performance across multiple brands and markets. Few-shot examples are drawn from WPP's internal archive of historically scored campaigns.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong><a href=\"https:\/\/arxiv.org\/abs\/2510.04009\">Usefulness, Originality and Suprise<\/a> (UOS) Pipeline.<\/strong> Evaluates the classic definition of divergent creative value through three dimensions:&nbsp;<strong>Usefulness<\/strong>&nbsp;(does it solve a real problem and align with the brief's constraints?),&nbsp;<strong>Originality<\/strong>&nbsp;(does it approach the problem in a novel way?), and&nbsp;<strong>Surprise<\/strong>&nbsp;(does it deliver an unexpected twist that captures attention?). The three dimension scores are aggregated into an overall UOS score.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong><a href=\"https:\/\/arxiv.org\/pdf\/2401.12491\">Fluency, Flexibility and Elaboration (FFE) Pipeline<\/a>.<\/strong> Quantifies the \"mental engine\" behind the idea through three dimensions:&nbsp;<strong>Fluency<\/strong>&nbsp;(how many distinct, relevant ideas are presented),&nbsp;<strong>Flexibility<\/strong>&nbsp;(how many different conceptual categories are explored), and&nbsp;<strong>Elaboration<\/strong>&nbsp;(how richly detailed and refined the idea is). Grounded in creativity research literature and benchmarked against marketing creativity datasets. Few-shot examples are generated using <a href=\"https:\/\/en.wikipedia.org\/wiki\/Torrance_Tests_of_Creative_Thinking\">Torrance-style<\/a> divergent thinking tasks (see Section 4.1).<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong><a href=\"https:\/\/arxiv.org\/pdf\/2509.09702\">Unique, Unexpected and Unforgettable (UUU) Pipeline<\/a>.<\/strong> Assesses brand longevity through the lens of a Creative Strategist:&nbsp;<strong>Unique<\/strong>&nbsp;(could only this idea deliver this message in this way?),&nbsp;<strong>Unexpected<\/strong>&nbsp;(does it subvert expectations and force re-evaluation?), and&nbsp;<strong>Unforgettable<\/strong>&nbsp;(does it create a defining moment that lives rent-free in the audience's mind?). Each dimension is scored as a continuous value and averaged into an overall UUU score.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong><a href=\"https:\/\/www.sciencedirect.com\/science\/article\/abs\/pii\/S1871187123001256?via%3Dihub\">OSCAI<\/a> Pipeline.<\/strong>&nbsp;A two-stage pipeline for measuring conceptual distance. First, a sub-agent extracts&nbsp;<strong>semantic relations<\/strong>&nbsp;from the idea (e.g.,&nbsp;<em>man \u2192 eats \u2192 apple<\/em>). Those relations are sent to the&nbsp;<strong>OSCAI API<\/strong>, maintained by the framework's original authors, which scores each relation's originality \u2014 distinguishing mundane links (<em>a chef cooks dinner<\/em>) from highly original relationships (<em>a clown teaches mathematics<\/em>). The returned scores quantify the creative leap at the heart of the idea.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Semiotics Pipeline.<\/strong>&nbsp;Applies the <a href=\"https:\/\/link.springer.com\/chapter\/10.1007\/978-1-4757-9700-8_3\">Saussurean principles<\/a> of sign systems to decode how meaning is constructed through cultural symbols. The sub-agent analyses&nbsp;<strong>Denotation<\/strong>&nbsp;(literal content),&nbsp;<strong>Connotation<\/strong>&nbsp;(implied meaning),&nbsp;<strong>Myth<\/strong>&nbsp;(cultural narratives reinforced or challenged), the&nbsp;<strong>Semiotic Relation<\/strong>&nbsp;(additive, contradictory, etc.),&nbsp;<strong>Risks or tensions<\/strong>, and produces a&nbsp;<strong>Semiotic Coherence Score<\/strong>&nbsp;(0\u20133). Unlike the other pipelines, no few-shot examples are used \u2014 evaluation relies on the model's inherent understanding of semiotic theory.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>2.3. Score normalization<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>The six frameworks operate on different native scales:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table {\"hasFixedLayout\":false} -->\n<figure class=\"wp-block-table\"><table><thead><tr><th><strong>Framework<\/strong><\/th><th><strong>Native scoring<\/strong><\/th><th><strong>Range<\/strong><\/th><\/tr><\/thead><tbody><tr><td>WPP Score<\/td><td>Sum of 4 dimensions, each 0\u20133<\/td><td>0\u201312<\/td><\/tr><tr><td>FFE<\/td><td>Average of 3 dimensions, each 0\u20133<\/td><td>0\u20133<\/td><\/tr><tr><td>UOS<\/td><td>Average of 3 dimensions, each 0\u20135<\/td><td>0\u20135<\/td><\/tr><tr><td>UUU<\/td><td>Average of 3 dimensions, each 0\u20135<\/td><td>0\u20135<\/td><\/tr><tr><td>OSCAI<\/td><td>Single score<\/td><td>0\u20135<\/td><\/tr><tr><td>Semiotics<\/td><td>Coherence score<\/td><td>0\u20133<\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p>Table 1: Creativity scores and their range.\n\n\n\n\n\nDirect comparison or summation across frameworks is misleading without normalisation. The following procedure is applied:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list {\"ordered\":true} -->\n<ol class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Convert sums to averages.<\/strong>&nbsp;The WPP Score (a sum of four 0\u20133 dimensions) is divided by 4 to produce a 0\u20133 average, making it structurally comparable to other averaged scores.<\/li>\n<!-- \/wp:list-item --><\/ol>\n<!-- \/wp:list -->\n\n<!-- wp:list {\"ordered\":true} -->\n<ol class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Rescale to a common 0\u201310 range.<\/strong>&nbsp;Each framework's score is divided by its native maximum and multiplied by 10: <code>normalised_score = (raw_score \/ max_score) \u00d7 10<\/code><\/li>\n<!-- \/wp:list-item --><\/ol>\n<!-- \/wp:list -->\n\n<!-- wp:list {\"ordered\":true} -->\n<ol class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Composite total.<\/strong>&nbsp;The 4 normalised pillar scores (WPP, FFE, UOS, UUU) are summed into a composite total with a&nbsp;<strong>maximum of 40<\/strong>, each pillar contributing equally.<\/li>\n<!-- \/wp:list-item --><\/ol>\n<!-- \/wp:list -->\n\n<!-- wp:paragraph -->\n<p><strong>OSCAI and Semiotics<\/strong>&nbsp;are reported as standalone scores and are&nbsp;<strong>not<\/strong>&nbsp;included in the composite total. This decision was made because OSCAI depends on an external API with different reliability characteristics, and both OSCAI and Semiotics showed higher inter-run variability (\u03c3 \u2248 0.80, see Section 4.3), which would add noise to the composite.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>2.4. Infrastructure and deployment<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:table {\"hasFixedLayout\":false} -->\n<figure class=\"wp-block-table\"><table><thead><tr><th><strong>Concern<\/strong><\/th><th><strong>Technology<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Multi-agent orchestration<\/td><td><a href=\"https:\/\/google.github.io\/adk-docs\/\">ADK<\/a><\/td><\/tr><tr><td>Compute<\/td><td><a href=\"https:\/\/cloud.google.com\/run\">Google Cloud Run<\/a><\/td><\/tr><tr><td>LLM inference<\/td><td><a href=\"https:\/\/cloud.google.com\/vertex-ai\">Vertex AI<\/a>&nbsp;\u2014&nbsp;<strong>Gemini 3 Pro<\/strong>&nbsp;as the primary judge model<\/td><\/tr><tr><td>Observability<\/td><td><a href=\"https:\/\/cloud.google.com\/logging\">Cloud Logging<\/a>&nbsp;+&nbsp;<a href=\"https:\/\/docs.cloud.google.com\/trace\/docs\">Cloud Trace<\/a><\/td><\/tr><tr><td>Container registry<\/td><td><a href=\"https:\/\/docs.cloud.google.com\/artifact-registry\/docs\/overview\">Artifact Registry<\/a><\/td><\/tr><tr><td>Agent-to-agent protocol<\/td><td><a href=\"https:\/\/a2a-protocol.org\/latest\/\">A2A<\/a><\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p>Table 2: Employed Google teck stack.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading -->\n<h2 class=\"wp-block-heading\">3. Ground truth data and system evaluation<\/h2>\n<!-- \/wp:heading -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>3.1 Dataset overview<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>Reliable automated scoring requires credible ground truth. Because no single public dataset covers all six frameworks, a combination of&nbsp;<strong>historical data<\/strong>&nbsp;and&nbsp;<strong>synthetically generated ground truth<\/strong>&nbsp;was used.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>3.2. WPP score \u2014 historical human judgements<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>The ground-truth dataset comes from&nbsp;<strong>WPP's internal archive<\/strong>&nbsp;of marketing campaign ideas submitted between 2020 and 2023. Creative professionals scored each idea across the WPP dimensions. From this corpus,&nbsp;<strong>6 scored ideas<\/strong>&nbsp;were selected as few-shot examples for the scoring sub-agent and&nbsp;<strong>10 additional ideas<\/strong>&nbsp;were reserved for its critic agent.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>3.3. FFE \u2014 synthetic data via Torrance-style tasks<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>Few-shot examples for the FFE framework were generated using&nbsp;<strong>Gemini<\/strong>&nbsp;prompted with tasks modelled on the&nbsp;<a href=\"https:\/\/psycnet.apa.org\/doiLanding?doi=10.1037%2Ft05532-000\">Torrance Tests of Creative Thinking<\/a>. Each example pairs a divergent-thinking task, a response, a score, and a justification. For instance:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:quote -->\n<blockquote class=\"wp-block-quote\"><!-- wp:paragraph -->\n<p><strong>Example 1 (Score 0):<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Task:<\/strong>&nbsp;Please list unusual uses of a plastic bottle.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Response:<\/strong>&nbsp;1. Plant a seed in it. 2. Use it to water plants by poking holes. 3. Cut it in half to make a small planter. 4. Use it to store extra fertiliser.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Justification:<\/strong>&nbsp;All ideas fall under a single, narrow category (Gardening \/ Horticulture). No conceptual shift is demonstrated.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list --><\/blockquote>\n<!-- \/wp:quote -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>3.4. UOS &amp; UUU \u2014 community-sourced creative writing<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>No pre-existing ground truth was available for these two frameworks, so it was constructed in three steps:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list {\"ordered\":true} -->\n<ol class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Source corpus.<\/strong>&nbsp;The&nbsp;<a href=\"https:\/\/huggingface.co\/datasets\/euclaise\/WritingPrompts_preferences\">Creative Storytelling dataset<\/a>&nbsp;(stories from r\/WritingPrompts on Hugging Face) was used as raw material.<\/li>\n<!-- \/wp:list-item --><\/ol>\n<!-- \/wp:list -->\n\n<!-- wp:list {\"ordered\":true} -->\n<ol class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Quality stratification.<\/strong>&nbsp;Stories were sorted by upvotes;&nbsp;<strong>11 highly upvoted<\/strong>&nbsp;and&nbsp;<strong>11 low-voted<\/strong>&nbsp;examples were selected to represent the ends of the quality spectrum.<\/li>\n<!-- \/wp:list-item --><\/ol>\n<!-- \/wp:list -->\n\n<!-- wp:list {\"ordered\":true} -->\n<ol class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Automated annotation.<\/strong>&nbsp;These 22 stories, together with the formal definitions of the UOS and UUU dimensions, were fed to Gemini, which produced scored examples that serve as the few-shot ground truth for both frameworks.<\/li>\n<!-- \/wp:list-item --><\/ol>\n<!-- \/wp:list -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>3.5. Scoring format<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>All scoring sub-agents output&nbsp;<strong>continuous values<\/strong>&nbsp;(e.g. 1.2, 2.8) rather than rounding to integers. This decision was made to stay consistent with the WPP ground-truth scores, which are themselves continuous, and to preserve finer-grained distinctions between ideas.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\">3.6. Variability analysis<\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>Reliability was assessed by scoring 55 campaign ideas (one per model) for a popular beverage brand, three times each. Each of the 5 ideas (one per AI model) was rated 3 times by the benchmark agent, and the standard deviation (std) across those 3 runs was computed per score. Each bar shows the average std across all 5 models, so taller bars mean the agent scores that dimension less consistently.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li>FFE (Fluency, Flexibility and Elaboration) score shows moderate variability with average std ~ 0.3. Further examination of each constituent creativity aspect evaluated by the FFE score, showed that the deviation is skewed because of Fluency\u2019s variance. This can be due to how Fluency is defined, which is \u201cEvaluate how many distinct, relevant ideas or solutions are presented. Count only meaningful and contextually appropriate ones (avoid repetition or vague statements).\u201d \u2014 an inherently count-based metric where the boundary between \"distinct\" and \"overlapping\" ideas introduces subjective judgment for an LLM.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li>OSCAI &amp; Semiotics show moderate variability with average std ~0.8, which directly motivated their exclusion from the composite tournament score.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:image {\"id\":1783,\"sizeSlug\":\"large\",\"linkDestination\":\"none\"} -->\n<figure class=\"wp-block-image size-large\"><img src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/2_small-1024x614.png\" alt=\"\" class=\"wp-image-1783\"\/><figcaption class=\"wp-element-caption\">Figure 2: Average Score Variability of each scoring sub agent.<\/figcaption><\/figure>\n<!-- \/wp:image -->\n\n<!-- wp:image {\"id\":1784,\"sizeSlug\":\"large\",\"linkDestination\":\"none\"} -->\n<figure class=\"wp-block-image size-large\"><img src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/3_small-1024x639.png\" alt=\"\" class=\"wp-image-1784\"\/><figcaption class=\"wp-element-caption\">Figure 3: Score variability for the Fluency, Flexibility and Elaboration aspects that the FFE score measures.<\/figcaption><\/figure>\n<!-- \/wp:image -->\n\n<!-- wp:heading -->\n<h2 class=\"wp-block-heading\"><strong>4. LLM evaluation tournament<\/strong><\/h2>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>Full tournament results and key findings are covered in the blog post. This section documents the experimental design, technical implementation, per-framework results, and supplementary analysis.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>4.1 Experimental design<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p><strong>Players.<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table {\"hasFixedLayout\":false} -->\n<figure class=\"wp-block-table\"><table><thead><tr><th><strong>Player<\/strong><\/th><th><strong>Description<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>GPT-5<\/strong><\/td><td>Standalone, standardised prompt<\/td><\/tr><tr><td><strong>Gemini 3<\/strong><\/td><td>Standalone, standardised prompt<\/td><\/tr><tr><td><strong>Gemini 2.5<\/strong><\/td><td>Standalone, standardised prompt<\/td><\/tr><tr><td><strong>Claude Sonnet 4.5<\/strong><\/td><td>Standalone, standardised prompt<\/td><\/tr><tr><td><strong>Creative Brain<\/strong><\/td><td>WPP's multi-agent ideation system, built on Gemini 3<\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p>Table 3: The LLMs that took part in the evaluation tournament.\n\n\n\n\n\n<strong>Prompt design.<\/strong>&nbsp;Four standalone models received a standardised, neutral prompt to ensure a level playing field:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:quote -->\n<blockquote class=\"wp-block-quote\"><!-- wp:paragraph -->\n<p><em>\"Give me a creative marketing idea\/campaign based on the brief. Your output must have a title and three sections: Challenge, Core Idea, and Execution.\"<\/em><\/p>\n<!-- \/wp:paragraph --><\/blockquote>\n<!-- \/wp:quote -->\n\n<!-- wp:paragraph -->\n<p>The&nbsp;<strong>Creative Brain<\/strong>&nbsp;received the same brief but processed it through its own multi-agent ideation pipeline \u2014 testing whether agentic orchestration outperforms raw model capability given identical inputs.\n\n\n\n\n\n<strong>Briefs.<\/strong>&nbsp;Each model generated ideas for&nbsp;<strong>14 global brands<\/strong>&nbsp;spanning different categories and creative challenges.\n\n\n\n\n\n<strong>Iterations.<\/strong>&nbsp;Each model\u2013brand combination was run&nbsp;<strong>3 times<\/strong>, producing independent idea generations to account for output variance.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>4.2. Implementation details<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p><strong>Scoring.<\/strong>&nbsp;Gemini 3 was selected as the LLM that powered our scoring sub-agents. Rather than relying on side-by-side LLM comparisons (which can be inconsistent), the Creativity Evaluation Agent judged every idea&nbsp;<strong>independently<\/strong>, generating a structured creativity report with raw scores across all frameworks. This independent-scoring approach means each idea has a self-contained evaluation record that can be compared post hoc, eliminating ordering effects that plague pairwise LLM judging.\n\n\n\n\n\n<strong>Match simulation.<\/strong>&nbsp;An orchestration engine simulated head-to-head matches by computing the&nbsp;<strong>normalised score average<\/strong>&nbsp;across all evaluated frameworks for each idea on the same brief. Normalisation was applied per-framework to prevent any single framework from dominating (e.g., WPP scores range 0\u201312 while UUU averages range 1\u20135). For each brief, every pair of models was matched: the model with the higher normalised average won the match, the other lost. Draws were not permitted; in the event of an exact tie on normalised average, the match was recorded as a draw in <a href=\"https:\/\/www.notion.so\/Creativity-evaluation-agent-technical-summary-3302494e8bd9808e9260ec014c474de2?pvs=21\">Glicko-2<\/a> (outcome = 0.5).\n\n\n\n\n\n<strong>Glicko-2 parameters.<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table {\"hasFixedLayout\":false} -->\n<figure class=\"wp-block-table\"><table><thead><tr><th><strong>Parameter<\/strong><\/th><th><strong>Value<\/strong><\/th><th><strong>Rationale<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Initial rating (\u03bc\u2080)<\/td><td>1500<\/td><td>Standard Glicko-2 default<\/td><\/tr><tr><td>Initial rating deviation (RD\u2080)<\/td><td>350<\/td><td>Standard Glicko-2 default; reflects maximum uncertainty<\/td><\/tr><tr><td>System volatility (\u03c3)<\/td><td>0.06<\/td><td>Standard default; controls expected rating fluctuation per period<\/td><\/tr><tr><td>Convergence tolerance (\u03c4)<\/td><td>0.000001<\/td><td>For the iterative volatility update step<\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p>Table 4: Glicko-2 parameter initialisation.\n\n\n\n\n\nRatings were updated after each complete round-robin cycle across all 14 briefs before proceeding to the next iteration. This means each \"rating period\" contained C(5,2) \u00d7 14 =&nbsp;<strong>140 matches<\/strong>&nbsp;(every pair of 5 models on every brief), and three rating periods were processed in sequence for the three iterations.\n\n\n\n\n\n<strong>Scale.<\/strong>&nbsp;Total matches: 3 iterations \u00d7 10 pairs \u00d7 14 briefs \u00d7 (1 match per pair-brief) =&nbsp;<strong>210 unique creative matches<\/strong>&nbsp;across the tournament per ranking method. For per-framework rankings, the same 210-match structure was applied but using the single-framework score (normalised) rather than the cross-framework composite.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>4.3 Per-framework leaderboards<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>To understand&nbsp;<em>where<\/em>&nbsp;each model's strengths and weaknesses lie, the same Glicko-2 tournament was run using each individual evaluation framework's scores as the match-outcome criterion. The results reveal meaningfully different competitive profiles across creative dimensions. We omit the WPP and the aggregate Elo scores since they are available in the executive summary. Additionally we omit the Semiotics and OSCAI Elo scores due to their high scoring variance.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>4.3.1 FFE (Fluency, Flexibility, Elaboration)<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:table {\"hasFixedLayout\":false} -->\n<figure class=\"wp-block-table\"><table><thead><tr><th><strong>Rank<\/strong><\/th><th><strong>Player<\/strong><\/th><th><strong>Rating<\/strong><\/th><th><strong>RD<\/strong><\/th><\/tr><\/thead><tbody><tr><td>\ud83e\udd47<\/td><td>GPT-5<\/td><td>1895<\/td><td>88.4<\/td><\/tr><tr><td>\ud83e\udd48<\/td><td><strong>Creative Brain<\/strong>&nbsp;(Gemini 3)<\/td><td>1651<\/td><td>71.8<\/td><\/tr><tr><td>\ud83e\udd49<\/td><td>Claude Sonnet 4.5<\/td><td>1583<\/td><td>72.2<\/td><\/tr><tr><td>4<\/td><td>Gemini 3<\/td><td>1358<\/td><td>77.8<\/td><\/tr><tr><td>5<\/td><td>Gemini 2.5<\/td><td>1285<\/td><td>74.1<\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p>Table 5: Elo ratings on FFE score.\n\n\n\n\n\nGPT-5 leads comfortably on FFE metrics. The gap between Creative Brain and Claude Sonnet 4.5 is narrow (~68 points), suggesting comparable idea elaboration depth. All Rating deviation (RD) values are below 89, indicating stable ratings.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>4.3.2 UOS (Uniqueness, Originality, Surprise)<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:table {\"hasFixedLayout\":false} -->\n<figure class=\"wp-block-table\"><table><thead><tr><th><strong>Rank<\/strong><\/th><th><strong>Player<\/strong><\/th><th><strong>Rating<\/strong><\/th><th><strong>RD<\/strong><\/th><\/tr><\/thead><tbody><tr><td>\ud83e\udd47<\/td><td><strong>Creative Brain<\/strong>&nbsp;(Gemini 3)<\/td><td>2006<\/td><td>95.5<\/td><\/tr><tr><td>\ud83e\udd48<\/td><td>GPT-5<\/td><td>1657<\/td><td>83.6<\/td><\/tr><tr><td>\ud83e\udd49<\/td><td>Gemini 3<\/td><td>1567<\/td><td>81.7<\/td><\/tr><tr><td>4<\/td><td>Claude Sonnet 4.5<\/td><td>1283<\/td><td>84.7<\/td><\/tr><tr><td>5<\/td><td>Gemini 2.5<\/td><td>1009<\/td><td>129.0<\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p>Table 6: Elo ratings on UOS score.\n\n\n\n\n\nCreative Brain dominates originality, with a&nbsp;<strong>349-point lead<\/strong>&nbsp;over GPT-5 \u2014 the widest gap between the top two players in any framework. Notably, standalone Gemini 3 ranks 3rd here (above Claude Sonnet 4.5), suggesting the base model has latent originality that Creative Brain's orchestration amplifies dramatically. Gemini 2.5's elevated RD (129.0) indicates volatile originality performance.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>4.3.3 UUU (Unexpected, Useful, Ultra-specific)<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:table {\"hasFixedLayout\":false} -->\n<figure class=\"wp-block-table\"><table><thead><tr><th><strong>Rank<\/strong><\/th><th><strong>Player<\/strong><\/th><th><strong>Rating<\/strong><\/th><th><strong>RD<\/strong><\/th><\/tr><\/thead><tbody><tr><td>\ud83e\udd47<\/td><td><strong>Creative Brain<\/strong>&nbsp;(Gemini 3)<\/td><td>2028<\/td><td>107.5<\/td><\/tr><tr><td>\ud83e\udd48<\/td><td>GPT-5<\/td><td>1653<\/td><td>82.8<\/td><\/tr><tr><td>\ud83e\udd49<\/td><td>Gemini 3<\/td><td>1441<\/td><td>79.0<\/td><\/tr><tr><td>4<\/td><td>Claude Sonnet 4.5<\/td><td>1377<\/td><td>78.2<\/td><\/tr><tr><td>5<\/td><td>Gemini 2.5<\/td><td>958<\/td><td>100.2<\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p>Table 7: Elo ratings on UUU score.\n\n\n\n\n\nCreative Brain achieves its highest absolute rating (2028) on UUU \u2014 a&nbsp;<strong>375-point lead<\/strong>&nbsp;over GPT-5. This framework rewards ideas that are simultaneously surprising&nbsp;<em>and<\/em>&nbsp;actionable, which aligns with the multi-agent pipeline's design goal: push for unexpected angles while grounding them in executable detail. Creative Brain's slightly elevated RD (107.5) suggests occasional variance, but the margin is decisive.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>4.4 Cross-framework analysis<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>The per-framework breakdowns reveal distinct competitive profiles:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table {\"hasFixedLayout\":false} -->\n<figure class=\"wp-block-table\"><table><thead><tr><th><strong>Player<\/strong><\/th><th><strong>FFE<\/strong><\/th><th><strong>UOS<\/strong><\/th><th><strong>UUU<\/strong><\/th><th><strong>WPP score<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>Creative Brain<\/strong><\/td><td>1651 (2nd)<\/td><td><strong>2006 (1st)<\/strong><\/td><td><strong>2028 (1st)<\/strong><\/td><td><strong>1889 (1st)<\/strong><\/td><\/tr><tr><td><strong>GPT-5<\/strong><\/td><td><strong>1895 (1st)<\/strong><\/td><td>1657 (2nd)<\/td><td>1653 (2nd)<\/td><td>1858 (2nd)<\/td><\/tr><tr><td><strong>Claude Sonnet 4.5<\/strong><\/td><td>1583 (3rd)<\/td><td>1283 (4th)<\/td><td>1377 (4th)<\/td><td>1529 (3rd)<\/td><\/tr><tr><td><strong>Gemini 3<\/strong><\/td><td>1358 (4th)<\/td><td>1567 (3rd)<\/td><td>1441 (3rd)<\/td><td>1169 (4th)<\/td><\/tr><tr><td><strong>Gemini 2.5<\/strong><\/td><td>1285 (5th)<\/td><td>1009 (5th)<\/td><td>958 (5th)<\/td><td>962 (5th)<\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p>Table 8: Aggregate Elo ratings.\n\n\n\n\n\n<strong>Key patterns:<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Creative Brain's advantage is most scores.<\/strong>&nbsp;It ranks 1st on UOS, UUU, and WPP score while it drops to 2nd on FFE.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>GPT-5 is the second best when it comes to creativity.<\/strong>&nbsp;It ranks 1st or 2nd on every single framework. Its weakest showing is 2nd place on UOS, UUU and WPP score, behind Creative Brain.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Claude Sonnet 4.5 has a spiked profile.<\/strong>&nbsp;Competitive on FFE (3rd, close to Creative Brain), but drops to 4th on UOS and UUU. This suggests its outputs are well-elaborated but less likely to produce unexpected or surprising creative leaps.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Gemini 3 benefits substantially from the complex orchestration that Creative Brain introduces.<\/strong>&nbsp;Across every framework, Creative Brain outperforms standalone Gemini 3 \u2014 the smallest gap is ~293 points (FFE) and the largest is ~720 points (WPP score).<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>4.5 Rating deviation &amp; confidence<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>Rating deviation (RD) indicates how confident the system is in each player's rating \u2014 lower RD means more predictable performance and a more stable estimate.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table {\"hasFixedLayout\":false} -->\n<figure class=\"wp-block-table\"><table><thead><tr><th><strong>Player<\/strong><\/th><th><strong>Avg RD<\/strong><\/th><th><strong>Min RD<\/strong><\/th><th><strong>Max RD<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>Creative Brain<\/strong><\/td><td>89.9<\/td><td>71.8 (FFE)<\/td><td>107.5 (UUU)<\/td><\/tr><tr><td><strong>GPT-5<\/strong><\/td><td>86.8<\/td><td>82.8 (UUU)<\/td><td>92.3 (WPP)<\/td><\/tr><tr><td><strong>Claude Sonnet 4.5<\/strong><\/td><td>79.8<\/td><td>72.2 (FFE)<\/td><td>84.7 (UOS)<\/td><\/tr><tr><td><strong>Gemini 3<\/strong><\/td><td>82.3<\/td><td>77.8 (FFE)<\/td><td>90.7 (WPP)<\/td><\/tr><tr><td><strong>Gemini 2.5<\/strong><\/td><td>101.9<\/td><td>74.1 (FFE)<\/td><td>129.0 (UOS)<\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p>Table 9: Mean RD of the LLM players across FFE, UOS, UUU and WPP score ratings.\n\n\n\n\n\nAll top-three players converged to RD values below 96 on every framework (with the exception of Creative Brain's 107.5 on UUU). Gemini 2.5's RD reaches 129.0 on UOS, indicating that its originality performance is especially unpredictable \u2014 consistent with its higher error rate observed during calibration (Section 4.1.1).\n\n\n\n\n\nClaude Sonnet 4.5 has the lowest average RD (79.8), meaning its performance is the&nbsp;<em>most predictable<\/em>&nbsp;of all players \u2014 it reliably delivers a certain quality level even if that ceiling is lower than GPT-5 or Creative Brain on some dimensions.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\"><strong>4.6 Limitations<\/strong><\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Judge model bias.<\/strong>&nbsp;All evaluations were performed using a single LLM as the judge in each scoring sub-agent. While calibrated against human ground truth, any systematic blind spots in the underlying LLM could advantage or disadvantage specific players. Future work should include multi-judge ensembles.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Prompt parity vs. system parity.<\/strong>&nbsp;Creative Brain receives the same brief as other players but processes it through a multi-agent pipeline \u2014 it does more inference work per idea. The tournament tests&nbsp;<em>system-level<\/em>&nbsp;creative output, not cost-normalised or latency-normalised performance.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Framework coverage.<\/strong>&nbsp;Semiotics and OSCAI were excluded from per-framework Elo analysis due to high scoring variance (Section 4.3).<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Brief diversity.<\/strong>&nbsp;14 briefs span a meaningful range of categories but may not cover all creative challenge types (e.g. non-English markets).<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Three iterations.<\/strong>&nbsp;While sufficient for Glicko-2 convergence to low RD in most cases, additional iterations would further tighten confidence intervals.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:heading -->\n<h2 class=\"wp-block-heading\">5. Conclusions<\/h2>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>The <strong>Creativity Evaluation Agent<\/strong> is deployed and usable via UI and API, and it produces reliable results. The multi-framework approach improves coverage and gives more actionable feedback than a single aggregate score. The path forward includes continued validation against broader and more diverse human panels, expansion of the tournament to track how model capabilities evolve across releases, and integration of the evaluation agent directly into creative workflows \u2014 not as a post-hoc judge, but as a real-time collaborator that scores, critiques, and refines ideas within the generation loop itself.<\/p>\n<!-- \/wp:paragraph -->","content_quarter":"Q1 2026","related_pods":["1454"],"featured":"","legacy_perspective_source_id":""},"_links":{"self":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/research_feed\/1654","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/research_feed"}],"about":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/types\/research_feed"}],"author":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/users\/18"}],"acf:post":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/research_pods\/1454"}],"wp:attachment":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1654"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1654"},{"taxonomy":"content_type","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcontent_types&post=1654"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fppma_author&post=1654"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}