{"id":357,"date":"2026-03-30T11:49:37","date_gmt":"2026-03-30T11:49:37","guid":{"rendered":"https:\/\/thelab.wppresolve.com\/?p=357"},"modified":"2026-04-21T22:54:18","modified_gmt":"2026-04-21T22:54:18","slug":"why-your-data-genome-may-need-a-check-up-and-how-a-data-discovery-agent-can-help","status":"publish","type":"post","link":"https:\/\/cms.research.wpp.com\/?p=357","title":{"rendered":"Why your data genome may need a check-up &#8211; and how a Data Discovery Agent can help"},"content":{"rendered":"\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>We collected every sample. We sequenced nothing.<\/strong><\/h2>\n\n\n\n<p>The story of big data is,&nbsp;at its core,&nbsp;a twenty-year experiment in accumulation.&nbsp;Google&#8217;s landmark papers on the Google File System&nbsp;(2003)&nbsp;and MapReduce&nbsp;(2004)&nbsp;showed the world how to store and process data at unprecedented scale&nbsp;[<a href=\"#ref1\" data-type=\"internal\" data-id=\"#ref1\">1<\/a>] [<a href=\"#ref2\" data-type=\"internal\" data-id=\"#ref2\">2<\/a>].&nbsp;The Hadoop ecosystem followed.&nbsp;Then the cloud era&nbsp;&#8211;&nbsp;BigQuery,&nbsp;Redshift,&nbsp;Snowflake&nbsp;&#8211;&nbsp;made collection cheaper,&nbsp;faster,&nbsp;and infinitely more elastic.&nbsp;The message was unambiguous:&nbsp;<em>store everything, figure it out later.<\/em><\/p>\n\n\n\n<p>And store they did.&nbsp;IDC projects that global data creation will grow at roughly 25%&nbsp;compound annual rate through 2028&nbsp;[<a href=\"#ref3\" data-type=\"internal\" data-id=\"#ref3\">3<\/a>].&nbsp;Companies rushed to stake claims on every click,&nbsp;every impression,&nbsp;every conversion,&nbsp;convinced the data itself was the gold.&nbsp;But gold in the ground is worthless without extraction,&nbsp;refining,&nbsp;and assay.&nbsp;Most enterprises skipped those steps.&nbsp;Industry research finds that up to 90%&nbsp;of generated data remains unused,&nbsp;a phenomenon analysts have coined&nbsp;&#8220;dark data&#8221; [<a href=\"#ref4\" data-type=\"internal\" data-id=\"#ref4\">4<\/a>].&nbsp;The storage invoice arrives monthly.&nbsp;The insight dividend has not been declared yet&nbsp;[<a href=\"#ref5\" data-type=\"internal\" data-id=\"#ref5\">5<\/a>] [<a href=\"#ref6\" data-type=\"internal\" data-id=\"#ref6\">6<\/a>].<\/p>\n\n\n\n<p>Enter the paradigm shift.&nbsp;Large Language Models&nbsp;(LLMs)&nbsp;&#8211;&nbsp;GPT-3&nbsp;(2020),&nbsp;ChatGPT&nbsp;(2022),&nbsp;Google&#8217;s Gemini&nbsp;(2023-2024)&nbsp;&#8211;&nbsp;represent the first technology that can&nbsp;<em>read<\/em>&nbsp;data at the semantic level&nbsp;[<a href=\"#ref7\" data-type=\"internal\" data-id=\"#ref7\">7<\/a>], understanding <em>meaning<\/em> rather than merely executing queries.&nbsp;A data warehouse full of unanalysed tables is,&nbsp;in many respects,&nbsp;like an unsequenced genome.&nbsp;The information is all there&nbsp;&#8211;&nbsp;every column a base pair,&nbsp;every table a chromosome&nbsp;&#8211;&nbsp;but without annotation,&nbsp;it&#8217;s just a very expensive string of letters.<\/p>\n\n\n\n<p>Enterprise data is at that same inflection point. The sequencer has arrived. The genome is finally being read.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">2,791 base pairs. No annotation. Good Luck, Have Fun, Don\u2019t Die.<\/h2>\n\n\n\n<p>Every ad-tech data team carries an invisible tax:&nbsp;the hours,&nbsp;the errors,&nbsp;and the opportunity cost of manually reconciling platform schemas that were never designed to talk to each other.&nbsp;This is not a one-time project either but a recurring levy.&nbsp;Every time a platform updates its API,&nbsp;every time a new data source is onboarded,&nbsp;every time someone asks&nbsp;<em>&#8220;do we even have geo data from Pinterest?&#8221;<\/em>&nbsp;&#8211;&nbsp;the tax collector comes knocking.<\/p>\n\n\n\n<p>At the heart of our work sits a centralised advertising data warehouse that aggregates campaign performance,&nbsp;creative assets,&nbsp;audience signals,&nbsp;geographic breakdowns,&nbsp;and brand metadata from every major digital advertising platform our organisation operates on.&nbsp;It is,&nbsp;in effect,&nbsp;the single source of truth for understanding how creative content performs across the entire digital media landscape.&nbsp;It is also,&nbsp;as we were about to discover,&nbsp;a genome that had never been sequenced.<\/p>\n\n\n\n<p>We operate across fourteen advertising platforms&nbsp;&#8211;&nbsp;spanning major social,&nbsp;search,&nbsp;programmatic,&nbsp;and measurement partners.&nbsp;Each has its own schema.&nbsp;Facebook calls advertising expenditure <code>amount_spent<\/code>. Google calls it <code>cost_micros<\/code> (and means it, in millionths of a currency unit). TikTok simply says <code>spend<\/code>.&nbsp;They all mean roughly the same thing,&nbsp;but to a database and to a human analyst trying to build a cross-platform report,&nbsp;they might as well be different species encoding the same protein with entirely different codons&nbsp;(\/\u02c8k\u0259\u028ad\u0252n\/&nbsp;&#8211;&nbsp;three-letter DNA sequences that each specify the same amino acid,&nbsp;just spelled differently by each organism).<\/p>\n\n\n\n<p>Data integration platforms and transformation layers help&nbsp;&#8211;&nbsp;they abstract some of the raw API complexity and reshape the data before it lands in the warehouse.&nbsp;In theory,&nbsp;these layers converge towards clarity.&nbsp;In practice,&nbsp;they shift the problem rather than solve it.&nbsp;By the time data reaches the staging tables analysts actually query,&nbsp;the original platform semantics have been filtered through multiple translation layers&nbsp;&#8211;&nbsp;each managed by a different team,&nbsp;each with its own conventions.&nbsp;The result is not less complexity;&nbsp;it is&nbsp;<em>distributed<\/em>&nbsp;complexity.<\/p>\n\n\n\n<p>Now scale the problem.&nbsp;Our Google BigQuery data warehouse spans 179 tables and&nbsp;<strong>2,709 columns<\/strong>&nbsp;across fourteen platforms,&nbsp;holding over 15.8 billion records.&nbsp;And those columns need to be understood by <em>meaning<\/em>, not merely by name.<\/p>\n\n\n\n<p>One downstream use case requires them to be mapped to a canonical schema organised around five&nbsp;<strong>modalities<\/strong>&nbsp;&#8211;&nbsp;high-level categories that describe what a piece of advertising data&nbsp;<em>is about<\/em>:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Performance<\/strong>&nbsp;&#8211; the numbers that tell you how a campaign did: spend, impressions, clicks, video views, conversions, leads.<\/li>\n\n\n\n<li><strong>Creative<\/strong>&nbsp;&#8211; what the ad actually looked like: creative IDs, asset paths, ad names, format types.<\/li>\n\n\n\n<li><strong>Audience<\/strong>&nbsp;&#8211; who saw it: gender, age group, interests, custom audiences, behavioural segments.<\/li>\n\n\n\n<li><strong>Geo<\/strong>&nbsp;&#8211; where they saw it: countries, regions, cities, postal codes, designated market areas.<\/li>\n\n\n\n<li><strong>Brand<\/strong>&nbsp;&#8211; who paid for it: brand name, advertiser identity.<\/li>\n<\/ul>\n\n\n\n<p>Each modality breaks down further into&nbsp;<strong>24 sub-modalities<\/strong>&nbsp;&#8211;&nbsp;the individual genes in our metaphor.&nbsp;That&#8217;s 2,709 potential annotations to make,&nbsp;verify,&nbsp;and maintain.<\/p>\n\n\n\n<p>A skilled data analyst doing this manually needs to:&nbsp;open each table schema,&nbsp;read every column name and type,&nbsp;pull sample data to disambiguate,&nbsp;decide which canonical sub-modality it maps to,&nbsp;document the mapping,&nbsp;and then repeat thousands more times&nbsp;&#8211;&nbsp;praying that nothing changed since they started.<\/p>\n\n\n\n<p>To make that concrete:&nbsp;imagine staring at a table with columns&nbsp;<code>ad_id<\/code>,&nbsp;<code>date_start<\/code>, <code>spend<\/code>, <code>inline_link_clicks<\/code>, <code>cpc<\/code>, <code>frequency<\/code>, <code>reach<\/code>,&nbsp;<code>account_currency<\/code>.&nbsp;Is&nbsp;<code>spend<\/code>&nbsp;total or daily?&nbsp;Is&nbsp;<code>cpc<\/code>&nbsp;cost-per-click or cost-per-conversion?&nbsp;Does&nbsp;<code>reach<\/code>&nbsp;mean unique users or total impressions?&nbsp;You run a&nbsp;<code>SELECT * LIMIT 10<\/code>,&nbsp;squint at the numbers,&nbsp;cross-reference the API docs,&nbsp;and after twenty minutes you&#8217;ve mapped eight columns from one table.&nbsp;Only 2,701 to go&nbsp;&#8211;&nbsp;across thirteen more platforms.<\/p>\n\n\n\n<p>Conservatively,&nbsp;that&#8217;s&nbsp;<strong>two to four weeks of focused work<\/strong>&nbsp;for a single pass.&nbsp;And the result starts decaying immediately.&nbsp;Platforms update schemas quarterly,&nbsp;sometimes monthly.&nbsp;By the time the spreadsheet is&nbsp;&#8220;done,&#8221;&nbsp;it&#8217;s already wrong.<\/p>\n\n\n\n<p>If you&#8217;ve ever opened a file called&nbsp;<code>FINAL_mapping_v3_actually_final.xlsx<\/code>,&nbsp;this section is for you.<\/p>\n\n\n\n<p>In genomic terms:&nbsp;it&#8217;s annotating a genome by hand,&nbsp;one base pair at a time,&nbsp;with no automated sequencer and no reference genome.&nbsp;The Human Genome Project proved you&nbsp;<em>can<\/em>&nbsp;do it that way.&nbsp;It just took thirteen years and&nbsp;$2.7 billion.&nbsp;We wanted something faster.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">The sequencer has entered the lab<\/h2>\n\n\n\n<p>What used to take a team of data analysts two to four weeks of grinding manual work&nbsp;(opening schemas,&nbsp;pulling samples,&nbsp;cross-referencing documentation,&nbsp;writing mappings into a spreadsheet)&nbsp;now runs in&nbsp;<em>hours<\/em>.&nbsp;The Data Discovery Agent processes all fourteen platforms,&nbsp;all 179 tables,&nbsp;all 2,709 columns,&nbsp;and delivers a fully annotated mapping with confidence scores.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"572\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/manual-vs-automatic2-1024x572.jpg\" alt=\"\" class=\"wp-image-359\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/manual-vs-automatic2-1024x572.jpg 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/manual-vs-automatic2-300x167.jpg 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/manual-vs-automatic2-768x429.jpg 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/manual-vs-automatic2-1536x857.jpg 1536w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/manual-vs-automatic2-2048x1143.jpg 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><em>Figure 1 &#8211; Manual mapping workflow vs. Agent pipeline: weeks of spreadsheet wrangling compressed into a single automated run.<\/em><\/figcaption><\/figure>\n\n\n\n<p>The Data Discovery Agent is not a chatbot, nor a single prompt you paste into ChatGPT and hope for the best.&nbsp;It is a multi-stage autonomous platform that orchestrates a complete end-to-end pipeline of discovery,&nbsp;sampling,&nbsp;reasoning,&nbsp;and reporting with minimal human intervention.&nbsp;You point it at a data warehouse,&nbsp;define your canonical schema,&nbsp;and it does the rest:&nbsp;connecting to BigQuery,&nbsp;fetching every table and column,&nbsp;extracting sample rows for grounding,&nbsp;crawling external documentation for cross-reference,&nbsp;building structured prompts,&nbsp;invoking LLM inference,&nbsp;parsing the results,&nbsp;scoring confidence,&nbsp;calculating completeness,&nbsp;and rendering the whole thing in an interactive dashboard.&nbsp;The human&#8217;s job shifts from&nbsp;<em>doing the mapping<\/em>&nbsp;to&nbsp;<em>reviewing the mapping<\/em>:&nbsp;approving high-confidence matches,&nbsp;investigating ambiguous ones,&nbsp;and making strategic decisions about data gaps.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">So, what <em>is<\/em> an agent?<\/h3>\n\n\n\n<p>In the AI and machine learning community,&nbsp;an&nbsp;<strong>agent<\/strong>&nbsp;is formally defined as a system that: (1)&nbsp;perceives its environment<\/p>\n\n\n\n<p>(2)&nbsp;reasons about what actions to take<\/p>\n\n\n\n<p>(3)&nbsp;executes those actions autonomously<\/p>\n\n\n\n<p>(4)&nbsp;iterates towards a goal&nbsp;&#8211;&nbsp;often using tools and external data sources along the way.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"1024\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/agentdef-1024x1024.jpg\" alt=\"\" class=\"wp-image-360\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/agentdef-1024x1024.jpg 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/agentdef-300x300.jpg 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/agentdef-150x150.jpg 150w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/agentdef-768x768.jpg 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/agentdef-1536x1536.jpg 1536w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/agentdef.jpg 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><em>Figure 2 &#8211; A simple visual of how an AI agent operates: Perceive \u2192 Reason \u2192 Act \u2192 Iterate.<\/em><\/figcaption><\/figure>\n\n\n\n<p><\/p>\n\n\n\n<p>By that definition,&nbsp;our Data Discovery Agent is genuinely an agent:&nbsp;it perceives the warehouse&nbsp;(schema fetching),&nbsp;reasons about semantics&nbsp;(LLM inference with structured prompts),&nbsp;takes action&nbsp;(mapping columns,&nbsp;scoring confidence,&nbsp;crawling documentation),&nbsp;and iterates&nbsp;(batch processing across platforms and modalities,&nbsp;second-pass enrichment recommendations).&nbsp;This is far more than a single LLM call wrapped in a REST endpoint.&nbsp;It&#8217;s an orchestrated pipeline of perception-reasoning-action loops.<\/p>\n\n\n\n<p>Yes,&nbsp;the word&nbsp;&#8220;agent&#8221;&nbsp;has been stretched to meaninglessness in 2025-2026&nbsp;&#8211;&nbsp;every API wrapper calls itself one.&nbsp;We use it deliberately,&nbsp;with the formal definition above,&nbsp;because our system genuinely exhibits autonomous multi-step goal-directed behaviour.&nbsp;If your&nbsp;&#8220;agent&#8221;&nbsp;is a prompt template with a for-loop,&nbsp;&#8220;script&#8221;&nbsp;might be a more honest label.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">Inside the sequencer: how we taught an LLM to read a data warehouse<\/h2>\n\n\n\n<p>Teaching an LLM to read a data warehouse is not as simple as dumping a schema into a prompt.&nbsp;Column names are ambiguous.&nbsp;Data types are insufficient.&nbsp;Context is everything.&nbsp;Our pipeline gives the model maximum context at every step&nbsp;&#8211;&nbsp;schema structure,&nbsp;real sample values,&nbsp;external documentation,&nbsp;and a precisely defined target taxonomy&nbsp;&#8211;&nbsp;so that its annotations are grounded,&nbsp;not hallucinated.&nbsp;In genomic terms, this is the equivalent of handing a bioinformatician the raw sequence alongside the reference genome, the medical textbook, and a clear checklist of which genes to look for.<\/p>\n\n\n\n<p>The following five stages take us from raw warehouse to annotated genome.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Stage 1 &#8211; Discovery<\/strong><\/h3>\n\n\n\n<p>The agent connects to Google BigQuery and discovers all staging datasets&nbsp;&#8211;&nbsp;in our case,&nbsp;fourteen platforms matching a configurable naming pattern.&nbsp;For each dataset,&nbsp;it fetches the full table inventory:&nbsp;column names,&nbsp;data types,&nbsp;row counts.&nbsp;Intelligent filters exclude temporary artefacts&nbsp;(tables prefixed with&nbsp;<code>stg_<\/code>,&nbsp;suffixed with&nbsp;<code>_tmp<\/code>&nbsp;or&nbsp;<code>_dbt_tmp<\/code>)&nbsp;&#8211;&nbsp;the genomic equivalent of filtering out sequencing adapters before analysis.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Stage 2 &#8211; Sampling<\/strong><\/h3>\n\n\n\n<p>Column names alone are often ambiguous.&nbsp;The agent resolves this by extracting actual sample data rows from each table,&nbsp;using background-threaded parallel queries.&nbsp;These real values become critical evidence for the LLM:&nbsp;seeing&nbsp;<code>14.50<\/code>,&nbsp;<code>0.83<\/code>,&nbsp;<code>127.99<\/code>&nbsp;in a&nbsp;<code>cost<\/code>&nbsp;column strongly suggests monetary spend,&nbsp;not an ID field.&nbsp;Column names are the labels on the test tubes;&nbsp;sample data is what&#8217;s actually inside them.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Stage 3 &#8211; Documentation crawling<\/strong><\/h3>\n\n\n\n<p>The agent crawls the official connector documentation for each platform&nbsp;&#8211;&nbsp;the authoritative&nbsp;&#8220;Most Used Fields&#8221;&nbsp;pages published by our data integration provider&nbsp;&#8211;&nbsp;and extracts a structured field list:&nbsp;name,&nbsp;description,&nbsp;dimension or metric.&nbsp;This gives the agent a reference genome to cross-reference against what&nbsp;<em>actually<\/em>&nbsp;exists in the warehouse:&nbsp;the platform&#8217;s own user manuals turned into a checklist of expected signals.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Stage 4 &#8211; LLM inference<\/strong><\/h3>\n\n\n\n<p>This is the core intellectual step.&nbsp;The agent builds versioned prompts containing the table schemas,&nbsp;sample data,&nbsp;and the target modality definition.&nbsp;These are sent to Google&#8217;s Batch Inference Service&nbsp;(VertexAI).&nbsp;The model returns structured JSON:&nbsp;for each canonical sub-modality,&nbsp;it identifies matching columns,&nbsp;assigns a confidence score&nbsp;(0.0\u20131.0),&nbsp;and provides natural-language reasoning for each match.&nbsp;This is the sequencing run itself&nbsp;&#8211;&nbsp;the machine reading every base pair and producing an annotated readout with quality scores.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Stage 5 &#8211; Enrichment and dashboard (the genome report)<\/strong><\/h3>\n\n\n\n<p>A second-pass&nbsp;&#8220;Enrichment Recommender&#8221;&nbsp;inference cross-references the warehouse mappings from Stage 4 against the connector field lists from Stage 3.&nbsp;For every sub-modality where warehouse data is incomplete,&nbsp;it recommends which connector fields to enable&nbsp;&#8211;&nbsp;closing the loop from&nbsp;<em>&#8220;what do we have?&#8221;<\/em>&nbsp;to&nbsp;<em>&#8220;what should we turn on?&#8221;<\/em><\/p>\n\n\n\n<p>All results render in a FastAPI&nbsp;+&nbsp;Jinja2\/HTMX dashboard:&nbsp;the karyotype (\/\u02c8kari\u0259(\u028a)t\u028c\u026ap\/ the number and visual appearance of the chromosomes in the cell nuclei of an organism or species),&nbsp;the gene annotation browser,&nbsp;and the clinical recommendations in one interface.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">The genome report: what 2,709 annotations revealed<\/h2>\n\n\n\n<p>When the sequencing run completes,&nbsp;you don&#8217;t get a wall of text.&nbsp;You get a genome report in the shape of a visual,&nbsp;interactive,&nbsp;filterable dashboard that tells you,&nbsp;at a glance,&nbsp;which parts of your data genome are fully annotated and which have gaps.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"589\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/dashboard-screenshot-1024x589.png\" alt=\"\" class=\"wp-image-361\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/dashboard-screenshot-1024x589.png 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/dashboard-screenshot-300x173.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/dashboard-screenshot-768x442.png 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/dashboard-screenshot-1536x883.png 1536w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/dashboard-screenshot.png 1565w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><em>Figure 3 &#8211; Platform completeness scorecards: the karyotype of our data warehouse at a glance.<\/em><\/figcaption><\/figure>\n\n\n\n<p>Each platform receives a&nbsp;&#8220;completeness&#8221;&nbsp;score:&nbsp;how much of the canonical schema is actually present.&nbsp;The approach is deliberately simple&nbsp;&#8211;&nbsp;a presence\/absence assay.&nbsp;For each platform,&nbsp;the denominator is 24&nbsp;(total sub-modalities).&nbsp;The numerator is how many have&nbsp;<em>at least one<\/em>&nbsp;column mapped above a configurable confidence threshold.&nbsp;A sub-modality is either present or it isn&#8217;t,&nbsp;like a gene that&#8217;s either expressed or silent.&nbsp;This also handles deduplication:&nbsp;if three tables each have a&nbsp;<code>spend<\/code>&nbsp;column mapped to&nbsp;<code>performance__spend_usd<\/code>,&nbsp;the sub-modality counts once,&nbsp;not three times.&nbsp;The formula cares about breadth of coverage,&nbsp;not depth of redundancy.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"437\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/submodalities-1024x437.png\" alt=\"\" class=\"wp-image-362\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/submodalities-1024x437.png 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/submodalities-300x128.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/submodalities-768x328.png 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/submodalities-1536x656.png 1536w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/submodalities.png 1548w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\"><em>Figure 4 &#8211; The interactive dashboard: Submodality matching recommendation cards<\/em><\/figcaption><\/figure>\n\n\n\n<p>Across all fourteen platforms,&nbsp;average mapping completeness stands at&nbsp;<strong>58%<\/strong>.&nbsp;The top performer reached 75%&nbsp;(18\/24 sub-modalities covered),&nbsp;while the platform with the greatest room for improvement sat at 25%&nbsp;(6\/24).&nbsp;Four platforms cluster at 70%,&nbsp;several sit in the 50\u201366%&nbsp;range,&nbsp;and a handful fall below 50%.<\/p>\n\n\n\n<p>Some findings were pleasant surprises&nbsp;&#8211;&nbsp;platforms assumed to lack audience data turned out to carry age range and gender columns that mapped cleanly.&nbsp;The data was there all along;&nbsp;it just hadn&#8217;t been annotated.&nbsp;Other findings were equally valuable:&nbsp;Performance metrics&nbsp;(spend,&nbsp;impressions,&nbsp;clicks)&nbsp;are well-represented across the board,&nbsp;but Audience,&nbsp;Geo,&nbsp;and Brand modalities show significant gaps,&nbsp;particularly in measurement and programmatic categories.<\/p>\n\n\n\n<p>At 58%,&nbsp;the genome is more than half-sequenced&nbsp;&#8211;&nbsp;but meaningful blind spots remain.&nbsp;That&#8217;s a&nbsp;<em>useful<\/em>&nbsp;headline.&nbsp;It is unequivocally better to know your genome&#8217;s current state than to assume it&#8217;s healthy without running the test.&nbsp;The scorecard turns vague unease into actionable intelligence&nbsp;(&#8220;We&#8217;re missing geo data on most platforms,&nbsp;and here&#8217;s exactly which fields to enable&#8221;).&nbsp;It also delivers a reality check:&nbsp;the foundation is stronger than expected in some areas and weaker than assumed in others.&nbsp;But now we know exactly where to invest.<\/p>\n\n\n\n<p>The Enrichment Recommender closes the diagnostic loop. Rather than simply flagging missing geo data, it prescribes a fix: <em>&#8220;The connector for this platform offers a field called <code>country_code<\/code> (dimension); enable it, and your geo coverage improves.&#8221;<\/em> Diagnosis and prescription in one step, and a natural bridge to the next phase of our work.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">What we&#8217;re building now<\/h2>\n\n\n\n<p>The genome report is valuable on its own.&nbsp;But a sequencer that only reads and never acts is leaving half the value on the table.&nbsp;We are currently working on three initiatives that extend the agent from discovery into action;&nbsp;from reading the genome to editing it.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>1. Productionalisation &#8211; from lab bench to clinical practice<\/strong><\/h3>\n\n\n\n<p>Today the agent runs as an internal tool,&nbsp;triggered manually.&nbsp;We are moving it to production-grade:&nbsp;CI\/CD pipelines,&nbsp;scheduled re-scans&nbsp;(daily,&nbsp;weekly,&nbsp;on-schema-change),&nbsp;and automated alerting when a platform&#8217;s schema mutates.&nbsp;The genomic equivalent of detecting a mutation before it causes problems downstream.&nbsp;A sequencer that runs itself.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>2. Automatic warehouse verification &#8211; is the gene actually expressed?<\/strong><\/h3>\n\n\n\n<p>The Enrichment Recommender tells you which fields <em>exist<\/em> in the documentation. But existence does not mean expression. A field can be listed, enabled in the connector, and still arrive empty because the advertiser never populated it.<\/p>\n\n\n\n<p>We are building a verification layer that connects to the data integration platform&#8217;s API,&nbsp;pulls each connector&#8217;s field list,&nbsp;and checks the warehouse:&nbsp;<em>Is this column actually populated? What percentage of rows have non-null values?<\/em><\/p>\n\n\n\n<p>In DNA terms:&nbsp;it&#8217;s one thing to know a gene exists in the reference genome;&nbsp;it&#8217;s another to confirm it&#8217;s expressed in your organism.&nbsp;The practical outcome:&nbsp;instead of blind recommendations,&nbsp;the agent will say&nbsp;<em>&#8220;Enable&nbsp;<code>country_code<\/code>&nbsp;&#8211; 87% of advertisers populate it&#8221;<\/em>&nbsp;versus&nbsp;<em>&#8220;<code>dma_region<\/code>&nbsp;has only 3% non-null rows, probably not worth the effort.&#8221;<\/em><\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>3. Text-to-SQL &#8211; chatting with the genome<\/strong><\/h3>\n\n\n\n<p>Discovery answers&nbsp;<em>&#8220;what data do we have?&#8221;<\/em>&nbsp;The next question is always&nbsp;<em>&#8220;what does the data say?&#8221;<\/em>&nbsp;We are building a natural-language-to-SQL layer:&nbsp;ask&nbsp;<em>&#8220;What was our total spend across social platforms last quarter?&#8221;<\/em>&nbsp;and the agent translates it to a BigQuery query&nbsp;&#8211;&nbsp;using the column mappings it already knows&nbsp;&#8211;&nbsp;executes it,&nbsp;and returns a human-readable answer with the underlying SQL for transparency.<\/p>\n\n\n\n<p>This positions the platform as a full conversational data layer:&nbsp;discover the genome,&nbsp;annotate it,&nbsp;score completeness,&nbsp;recommend enrichment,&nbsp;and then&nbsp;<em>interrogate<\/em>&nbsp;it in natural language.&nbsp;From specimen collection to clinical consultation,&nbsp;without leaving the lab.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\">The genome is sequenced. The real work begins.<\/h2>\n\n\n\n<p>For two decades, the advertising industry collected data the way early geneticists collected specimens: enthusiastically, comprehensively, and with very little annotation. We built warehouses the size of genomes and staffed them with analysts armed with spreadsheets &#8211; the biological equivalent of a magnifying glass and a very long afternoon.<\/p>\n\n\n\n<p>The Data Discovery Agent is what happens when you finally bring a sequencer into the lab. It reads every base pair, annotates every gene, scores every alignment, flags every gap, and delivers a genome report that turns &#8220;we have data&#8221; into &#8220;we understand our data.&#8221; And unlike the Human Genome Project, it does not take thirteen years. It takes hours.<\/p>\n\n\n\n<p>The genome is sequenced. The annotations are in. The gaps are mapped. Now the real work begins: not the tedium of cataloguing, but the science of insight. And if the sequencer keeps getting smarter (spoiler: it will), the next chapter will be about more than reading the data. It will be about having a conversation with it.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p><strong>Ready to look under the hood?<\/strong>&nbsp;See&nbsp;<a href=\"https:\/\/research.wpp.com\/pods\/data-discovery-agent-pod\" data-type=\"link\" data-id=\"https:\/\/research.wpp.com\/pods\/data-discovery-agent-pod\">Data Discovery Agent &#8211; Technical Walkthrough<\/a>&nbsp;for the full engineering deep-dive&nbsp;&#8211;&nbsp;architecture,&nbsp;five-stage pipeline,&nbsp;Vertex AI inference,&nbsp;FastAPI deployment on Cloud Run,&nbsp;and code-level detail.<\/p>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<ol class=\"wp-block-list\">\n<li id=\"ref1\">Dean, J., &amp; Ghemawat, S. (2004). <a href=\"https:\/\/research.google.com\/archive\/mapreduce-osdi04.pdf\">MapReduce: Simplified data processing on large clusters<\/a>. <em>Proceedings of OSDI &#8217;04<\/em>.<\/li>\n\n\n\n<li id=\"ref2\">Ghemawat, S., Gobioff, H., &amp; Leung, S.-T. (2003). <a href=\"https:\/\/research.google.com\/archive\/gfs-sosp2003.pdf\">The Google File System<\/a>. <em>Proceedings of the 19th ACM SOSP<\/em>.<\/li>\n\n\n\n<li id=\"ref3\">IDC, <a href=\"https:\/\/techcontentwave.com\/files\/1763389937_64b594fd4f7f5607256a.pdf\">Worldwide Global DataSphere Forecast 2023-2028<\/a>.<\/li>\n\n\n\n<li id=\"ref4\">Splunk dark data survey, summarised in <a href=\"https:\/\/www.cogentinfo.com\/resources\/dark-data-unlocking-the-90-you-dont-use\">&#8220;Dark Data: Unlocking the ~90% You Don&#8217;t Use&#8221;<\/a> (2025).<\/li>\n\n\n\n<li id=\"ref5\">McKinsey &amp; Company, <a href=\"https:\/\/www.mckinsey.com.br\/capabilities\/tech-and-ai\/our-insights\/in-search-of-cloud-value-can-generative-ai-transform-cloud-roi\">&#8220;In search of cloud value: Can generative AI transform cloud ROI?&#8221;<\/a> (2023).<\/li>\n\n\n\n<li id=\"ref6\">Forbes, <a href=\"https:\/\/datafoundation.org\/news\/blogs\/759\/759-Forbes-From-Data-Hoarding-To-Data-Strategy-Building-AI-That-Actually-Works\">&#8220;From Data Hoarding To Data Strategy, Building AI That Actually Works&#8221;<\/a> (2025).<\/li>\n\n\n\n<li id=\"ref7\">Forrester, <a href=\"https:\/\/techstrong.ai\/articles\/forrester-research-releases-its-ai-data-and-analytics-2024-predictions\/\">AI, data, and analytics predictions<\/a> (2024).<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p><em>Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP AI Lab team.<\/em><br><\/p>\n","protected":false},"excerpt":{"rendered":"<p>We built an AI agent that maps and annotates 2,709 columns across 14 ad platforms in a matter of hours.\u00a0Such work used to take analysts weeks with spreadsheets.\u00a0Think of it as a genome sequencer for your data warehouse:\u00a0it reads every column,\u00a0classifies it into a canonical schema,\u00a0scores confidence,\u00a0and flags gaps.\u00a0The result?\u00a0A 58%\u00a0completeness baseline,\u00a0actionable enrichment recommendations,\u00a0and the end of\u00a0FINAL_mapping_v3_actually_final.xlsx.<\/p>\n","protected":false},"author":13,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"tags":[],"ppma_author":[{"id":13,"display_name":"Tam\u00e1s Luk\u00e1cs","first_name":"Tam\u00e1s","last_name":"Luk\u00e1cs","nickname":"tamas.lukacs","user_nicename":"tamas-lukacs","user_email":"tamas.lukacs@satalia.com","biographical_info":"Tamas is a Senior Data and AI Engineer at Satalia with a background spanning data engineering, cloud architecture, and AI\/ML systems. He builds production-grade platforms that take AI from prototype to production - from LLM-powered discovery agents and RAG-based chatbots to high-throughput embedding services. His current focus is on scalable data and AI solutions for enterprise advertising clients on Google Cloud.","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/04\/2021_mod-1.jpg","job_title":"Senior Data & AI Engineer","is_lead":null,"display_as_researcher":null,"order_priority":null}],"class_list":["post-357","post","type-post","status-publish","format-standard","hentry"],"acf":{"related_pods":[593],"featured":false},"authors":[{"term_id":35,"user_id":13,"is_guest":0,"slug":"tamas-lukacs","display_name":"Tam\u00e1s Luk\u00e1cs","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/04\/2021_mod-1.jpg","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":"","biographical_info":"Tamas is a Senior Data and AI Engineer at Satalia with a background spanning data engineering, cloud architecture, and AI\/ML systems. He builds production-grade platforms that take AI from prototype to production - from LLM-powered discovery agents and RAG-based chatbots to high-throughput embedding services. His current focus is on scalable data and AI solutions for enterprise advertising clients on Google Cloud."}],"_links":{"self":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/posts\/357","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/users\/13"}],"replies":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=357"}],"version-history":[{"count":18,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/posts\/357\/revisions"}],"predecessor-version":[{"id":1278,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/posts\/357\/revisions\/1278"}],"wp:attachment":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=357"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=357"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fppma_author&post=357"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}