Why your data genome may need a check-up – and how a Data Discovery Agent can help

We collected every sample. We sequenced nothing.

The story of big data is, at its core, a twenty-year experiment in accumulation. Google’s landmark papers on the Google File System (2003) and MapReduce (2004) showed the world how to store and process data at unprecedented scale [1] [2]. The Hadoop ecosystem followed. Then the cloud era – BigQuery, Redshift, Snowflake – made collection cheaper, faster, and infinitely more elastic. The message was unambiguous: store everything, figure it out later.

And store they did. IDC projects that global data creation will grow at roughly 25% compound annual rate through 2028 [3]. Companies rushed to stake claims on every click, every impression, every conversion, convinced the data itself was the gold. But gold in the ground is worthless without extraction, refining, and assay. Most enterprises skipped those steps. Industry research finds that up to 90% of generated data remains unused, a phenomenon analysts have coined “dark data” [4]. The storage invoice arrives monthly. The insight dividend has not been declared yet [5] [6].

Enter the paradigm shift. Large Language Models (LLMs) – GPT-3 (2020), ChatGPT (2022), Google’s Gemini (2023-2024) – represent the first technology that can read data at the semantic level [7], understanding meaning rather than merely executing queries. A data warehouse full of unanalysed tables is, in many respects, like an unsequenced genome. The information is all there – every column a base pair, every table a chromosome – but without annotation, it’s just a very expensive string of letters.

Enterprise data is at that same inflection point. The sequencer has arrived. The genome is finally being read.

2,791 base pairs. No annotation. Good Luck, Have Fun, Don’t Die.

Every ad-tech data team carries an invisible tax: the hours, the errors, and the opportunity cost of manually reconciling platform schemas that were never designed to talk to each other. This is not a one-time project either but a recurring levy. Every time a platform updates its API, every time a new data source is onboarded, every time someone asks “do we even have geo data from Pinterest?” – the tax collector comes knocking.

At the heart of our work sits a centralised advertising data warehouse that aggregates campaign performance, creative assets, audience signals, geographic breakdowns, and brand metadata from every major digital advertising platform our organisation operates on. It is, in effect, the single source of truth for understanding how creative content performs across the entire digital media landscape. It is also, as we were about to discover, a genome that had never been sequenced.

We operate across fourteen advertising platforms – spanning major social, search, programmatic, and measurement partners. Each has its own schema. Facebook calls advertising expenditure amount_spent. Google calls it cost_micros (and means it, in millionths of a currency unit). TikTok simply says spend. They all mean roughly the same thing, but to a database and to a human analyst trying to build a cross-platform report, they might as well be different species encoding the same protein with entirely different codons (/ˈkəʊdɒn/ – three-letter DNA sequences that each specify the same amino acid, just spelled differently by each organism).

Data integration platforms and transformation layers help – they abstract some of the raw API complexity and reshape the data before it lands in the warehouse. In theory, these layers converge towards clarity. In practice, they shift the problem rather than solve it. By the time data reaches the staging tables analysts actually query, the original platform semantics have been filtered through multiple translation layers – each managed by a different team, each with its own conventions. The result is not less complexity; it is distributed complexity.

Now scale the problem. Our Google BigQuery data warehouse spans 179 tables and 2,709 columns across fourteen platforms, holding over 15.8 billion records. And those columns need to be understood by meaning, not merely by name.

One downstream use case requires them to be mapped to a canonical schema organised around five modalities – high-level categories that describe what a piece of advertising data is about:

Performance – the numbers that tell you how a campaign did: spend, impressions, clicks, video views, conversions, leads.
Creative – what the ad actually looked like: creative IDs, asset paths, ad names, format types.
Audience – who saw it: gender, age group, interests, custom audiences, behavioural segments.
Geo – where they saw it: countries, regions, cities, postal codes, designated market areas.
Brand – who paid for it: brand name, advertiser identity.

Each modality breaks down further into 24 sub-modalities – the individual genes in our metaphor. That’s 2,709 potential annotations to make, verify, and maintain.

A skilled data analyst doing this manually needs to: open each table schema, read every column name and type, pull sample data to disambiguate, decide which canonical sub-modality it maps to, document the mapping, and then repeat thousands more times – praying that nothing changed since they started.

To make that concrete: imagine staring at a table with columns ad_id, date_start, spend, inline_link_clicks, cpc, frequency, reach, account_currency. Is spend total or daily? Is cpc cost-per-click or cost-per-conversion? Does reach mean unique users or total impressions? You run a SELECT * LIMIT 10, squint at the numbers, cross-reference the API docs, and after twenty minutes you’ve mapped eight columns from one table. Only 2,701 to go – across thirteen more platforms.

Conservatively, that’s two to four weeks of focused work for a single pass. And the result starts decaying immediately. Platforms update schemas quarterly, sometimes monthly. By the time the spreadsheet is “done,” it’s already wrong.

If you’ve ever opened a file called FINAL_mapping_v3_actually_final.xlsx, this section is for you.

In genomic terms: it’s annotating a genome by hand, one base pair at a time, with no automated sequencer and no reference genome. The Human Genome Project proved you can do it that way. It just took thirteen years and $2.7 billion. We wanted something faster.

The sequencer has entered the lab

What used to take a team of data analysts two to four weeks of grinding manual work (opening schemas, pulling samples, cross-referencing documentation, writing mappings into a spreadsheet) now runs in hours. The Data Discovery Agent processes all fourteen platforms, all 179 tables, all 2,709 columns, and delivers a fully annotated mapping with confidence scores.

*Figure 1 – Manual mapping workflow vs. Agent pipeline: weeks of spreadsheet wrangling compressed into a single automated run.*

The Data Discovery Agent is not a chatbot, nor a single prompt you paste into ChatGPT and hope for the best. It is a multi-stage autonomous platform that orchestrates a complete end-to-end pipeline of discovery, sampling, reasoning, and reporting with minimal human intervention. You point it at a data warehouse, define your canonical schema, and it does the rest: connecting to BigQuery, fetching every table and column, extracting sample rows for grounding, crawling external documentation for cross-reference, building structured prompts, invoking LLM inference, parsing the results, scoring confidence, calculating completeness, and rendering the whole thing in an interactive dashboard. The human’s job shifts from doing the mapping to reviewing the mapping: approving high-confidence matches, investigating ambiguous ones, and making strategic decisions about data gaps.

So, what is an agent?

In the AI and machine learning community, an agent is formally defined as a system that: (1) perceives its environment

(2) reasons about what actions to take

(3) executes those actions autonomously

(4) iterates towards a goal – often using tools and external data sources along the way.

*Figure 2 – A simple visual of how an AI agent operates: Perceive → Reason → Act → Iterate.*

By that definition, our Data Discovery Agent is genuinely an agent: it perceives the warehouse (schema fetching), reasons about semantics (LLM inference with structured prompts), takes action (mapping columns, scoring confidence, crawling documentation), and iterates (batch processing across platforms and modalities, second-pass enrichment recommendations). This is far more than a single LLM call wrapped in a REST endpoint. It’s an orchestrated pipeline of perception-reasoning-action loops.

Yes, the word “agent” has been stretched to meaninglessness in 2025-2026 – every API wrapper calls itself one. We use it deliberately, with the formal definition above, because our system genuinely exhibits autonomous multi-step goal-directed behaviour. If your “agent” is a prompt template with a for-loop, “script” might be a more honest label.

Inside the sequencer: how we taught an LLM to read a data warehouse

Teaching an LLM to read a data warehouse is not as simple as dumping a schema into a prompt. Column names are ambiguous. Data types are insufficient. Context is everything. Our pipeline gives the model maximum context at every step – schema structure, real sample values, external documentation, and a precisely defined target taxonomy – so that its annotations are grounded, not hallucinated. In genomic terms, this is the equivalent of handing a bioinformatician the raw sequence alongside the reference genome, the medical textbook, and a clear checklist of which genes to look for.

The following five stages take us from raw warehouse to annotated genome.

Stage 1 – Discovery

The agent connects to Google BigQuery and discovers all staging datasets – in our case, fourteen platforms matching a configurable naming pattern. For each dataset, it fetches the full table inventory: column names, data types, row counts. Intelligent filters exclude temporary artefacts (tables prefixed with stg_, suffixed with _tmp or _dbt_tmp) – the genomic equivalent of filtering out sequencing adapters before analysis.

Stage 2 – Sampling

Column names alone are often ambiguous. The agent resolves this by extracting actual sample data rows from each table, using background-threaded parallel queries. These real values become critical evidence for the LLM: seeing 14.50, 0.83, 127.99 in a cost column strongly suggests monetary spend, not an ID field. Column names are the labels on the test tubes; sample data is what’s actually inside them.

Stage 3 – Documentation crawling

The agent crawls the official connector documentation for each platform – the authoritative “Most Used Fields” pages published by our data integration provider – and extracts a structured field list: name, description, dimension or metric. This gives the agent a reference genome to cross-reference against what actually exists in the warehouse: the platform’s own user manuals turned into a checklist of expected signals.

Stage 4 – LLM inference

This is the core intellectual step. The agent builds versioned prompts containing the table schemas, sample data, and the target modality definition. These are sent to Google’s Batch Inference Service (VertexAI). The model returns structured JSON: for each canonical sub-modality, it identifies matching columns, assigns a confidence score (0.0–1.0), and provides natural-language reasoning for each match. This is the sequencing run itself – the machine reading every base pair and producing an annotated readout with quality scores.

Stage 5 – Enrichment and dashboard (the genome report)

A second-pass “Enrichment Recommender” inference cross-references the warehouse mappings from Stage 4 against the connector field lists from Stage 3. For every sub-modality where warehouse data is incomplete, it recommends which connector fields to enable – closing the loop from “what do we have?” to “what should we turn on?”

All results render in a FastAPI + Jinja2/HTMX dashboard: the karyotype (/ˈkariə(ʊ)tʌɪp/ the number and visual appearance of the chromosomes in the cell nuclei of an organism or species), the gene annotation browser, and the clinical recommendations in one interface.

The genome report: what 2,709 annotations revealed

When the sequencing run completes, you don’t get a wall of text. You get a genome report in the shape of a visual, interactive, filterable dashboard that tells you, at a glance, which parts of your data genome are fully annotated and which have gaps.

*Figure 3 – Platform completeness scorecards: the karyotype of our data warehouse at a glance.*

Each platform receives a “completeness” score: how much of the canonical schema is actually present. The approach is deliberately simple – a presence/absence assay. For each platform, the denominator is 24 (total sub-modalities). The numerator is how many have at least one column mapped above a configurable confidence threshold. A sub-modality is either present or it isn’t, like a gene that’s either expressed or silent. This also handles deduplication: if three tables each have a spend column mapped to performance__spend_usd, the sub-modality counts once, not three times. The formula cares about breadth of coverage, not depth of redundancy.

*Figure 4 – The interactive dashboard: Submodality matching recommendation cards*

Across all fourteen platforms, average mapping completeness stands at 58%. The top performer reached 75% (18/24 sub-modalities covered), while the platform with the greatest room for improvement sat at 25% (6/24). Four platforms cluster at 70%, several sit in the 50–66% range, and a handful fall below 50%.

Some findings were pleasant surprises – platforms assumed to lack audience data turned out to carry age range and gender columns that mapped cleanly. The data was there all along; it just hadn’t been annotated. Other findings were equally valuable: Performance metrics (spend, impressions, clicks) are well-represented across the board, but Audience, Geo, and Brand modalities show significant gaps, particularly in measurement and programmatic categories.

At 58%, the genome is more than half-sequenced – but meaningful blind spots remain. That’s a useful headline. It is unequivocally better to know your genome’s current state than to assume it’s healthy without running the test. The scorecard turns vague unease into actionable intelligence (“We’re missing geo data on most platforms, and here’s exactly which fields to enable”). It also delivers a reality check: the foundation is stronger than expected in some areas and weaker than assumed in others. But now we know exactly where to invest.

The Enrichment Recommender closes the diagnostic loop. Rather than simply flagging missing geo data, it prescribes a fix: “The connector for this platform offers a field called country_code (dimension); enable it, and your geo coverage improves.” Diagnosis and prescription in one step, and a natural bridge to the next phase of our work.

What we’re building now

The genome report is valuable on its own. But a sequencer that only reads and never acts is leaving half the value on the table. We are currently working on three initiatives that extend the agent from discovery into action; from reading the genome to editing it.

1. Productionalisation – from lab bench to clinical practice

Today the agent runs as an internal tool, triggered manually. We are moving it to production-grade: CI/CD pipelines, scheduled re-scans (daily, weekly, on-schema-change), and automated alerting when a platform’s schema mutates. The genomic equivalent of detecting a mutation before it causes problems downstream. A sequencer that runs itself.

2. Automatic warehouse verification – is the gene actually expressed?

The Enrichment Recommender tells you which fields exist in the documentation. But existence does not mean expression. A field can be listed, enabled in the connector, and still arrive empty because the advertiser never populated it.

We are building a verification layer that connects to the data integration platform’s API, pulls each connector’s field list, and checks the warehouse: Is this column actually populated? What percentage of rows have non-null values?

In DNA terms: it’s one thing to know a gene exists in the reference genome; it’s another to confirm it’s expressed in your organism. The practical outcome: instead of blind recommendations, the agent will say “Enable country_code – 87% of advertisers populate it” versus “dma_region has only 3% non-null rows, probably not worth the effort.”

3. Text-to-SQL – chatting with the genome

Discovery answers “what data do we have?” The next question is always “what does the data say?” We are building a natural-language-to-SQL layer: ask “What was our total spend across social platforms last quarter?” and the agent translates it to a BigQuery query – using the column mappings it already knows – executes it, and returns a human-readable answer with the underlying SQL for transparency.

This positions the platform as a full conversational data layer: discover the genome, annotate it, score completeness, recommend enrichment, and then interrogate it in natural language. From specimen collection to clinical consultation, without leaving the lab.

The genome is sequenced. The real work begins.

For two decades, the advertising industry collected data the way early geneticists collected specimens: enthusiastically, comprehensively, and with very little annotation. We built warehouses the size of genomes and staffed them with analysts armed with spreadsheets – the biological equivalent of a magnifying glass and a very long afternoon.

The Data Discovery Agent is what happens when you finally bring a sequencer into the lab. It reads every base pair, annotates every gene, scores every alignment, flags every gap, and delivers a genome report that turns “we have data” into “we understand our data.” And unlike the Human Genome Project, it does not take thirteen years. It takes hours.

The genome is sequenced. The annotations are in. The gaps are mapped. Now the real work begins: not the tedium of cataloguing, but the science of insight. And if the sequencer keeps getting smarter (spoiler: it will), the next chapter will be about more than reading the data. It will be about having a conversation with it.

Ready to look under the hood? See Data Discovery Agent – Technical Walkthrough for the full engineering deep-dive – architecture, five-stage pipeline, Vertex AI inference, FastAPI deployment on Cloud Run, and code-level detail.

Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. Proceedings of OSDI ’04.
Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google File System. Proceedings of the 19th ACM SOSP.
IDC, Worldwide Global DataSphere Forecast 2023-2028.
Splunk dark data survey, summarised in “Dark Data: Unlocking the ~90% You Don’t Use” (2025).
McKinsey & Company, “In search of cloud value: Can generative AI transform cloud ROI?” (2023).
Forbes, “From Data Hoarding To Data Strategy, Building AI That Actually Works” (2025).
Forrester, AI, data, and analytics predictions (2024).

Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP AI Lab team.

Author

Tamás Lukács

Tamas is a Senior Data and AI Engineer at Satalia with a background spanning data engineering, cloud architecture, and AI/ML systems. He builds production-grade platforms that take AI from prototype to production – from LLM-powered discovery agents and RAG-based chatbots to high-throughput embedding services. His current focus is on scalable data and AI solutions for enterprise advertising clients on Google Cloud.