Author: Tamás Lukács

New Blog Post

March 30, 2026

Data Discovery Agent Pod

Turning raw ad platform schemas into actionable intelligence through AI-powered semantic mapping.

TL;DR

Up to 90% of enterprise data sits unused, and our BigQuery advertising warehouse was no exception: 14 platforms, 179 tables, 2,709 columns, 15.8 billion records, zero shared schema, and answering a question as simple as whether Pinterest carries geo data required a two-week manual investigation. Every platform invented its own vocabulary (Facebook says amount_spent, Google says cost_micros, TikTok says spend) and manual reconciliation cost analysts two to four weeks per pass, producing a spreadsheet that started decaying the moment it was finished.

We built the Data Discovery Agent, an AI-powered pipeline that connects to BigQuery, samples real data for grounding, crawls external connector documentation, and runs Gemini LLM inference with versioned prompts to autonomously annotate every column with confidence-scored mappings. The agent compressed weeks of manual work into hours, took us from zero structured understanding of our warehouse to a 58% average completeness (column presence per canonical sub-modality; signal strength and population validation planned for Q2) baseline across all 14 platforms, and replaced tribal-knowledge spreadsheets with a searchable, versioned genome report that turns “do we have that data?” from a research project into a dashboard lookup.

If you don’t care about the technical details, read our blog post instead.

Technical walkthrough

Introduction

Advertising data warehouses are, by nature, enormous and messy. Every major ad platform – Facebook, Google, TikTok, and a dozen others – exposes its own schema, its own naming conventions, and its own definition of what a “click” or a “spend” actually means. When these platforms feed into a centralised Google BigQuery data lake, the result is a sprawling collection of tables and columns that no single analyst can hold in their head. In our case: 14 advertising platforms, 179 tables, and 2,709 columns – with over 15.8 billion records behind them.

Historically, making sense of this warehouse required manual schema annotation: a data analyst opening each table, reading every column name, pulling sample rows, cross-referencing API documentation, and deciding which canonical category each column belonged to. Conservatively, that process took two to four weeks of focused work for a single pass. And the result started decaying immediately – platforms update schemas quarterly, sometimes monthly.

To eliminate this bottleneck, we architected and deployed the Data Discovery Agent: an AI-powered platform that autonomously discovers, maps, and scores the completeness of advertising data across an entire BigQuery data lake. The agent connects to BigQuery, fetches every table and column, extracts real sample data for grounding, crawls external documentation for cross-reference, invokes Vertex AI LLM inference with structured prompts, and delivers a fully annotated mapping with confidence scores – compressing weeks of manual work into hours.

The system is built as a production-grade FastAPI web application, deployed on Google Cloud Run, with a complete admin control panel, role-based access via Google OAuth, and a storage abstraction layer that operates identically on local filesystems and Google Cloud Storage. This document provides a comprehensive technical walkthrough of its architecture, pipeline, deployment, and results.

Agent architecture

The Data Discovery Agent is structured as a five-stage sequential pipeline orchestrated by a FastAPI backend. Unlike a chatbot or a single-prompt wrapper, the agent executes an end-to-end workflow of discovery, sampling, documentation crawling, LLM inference, and enrichment reporting – with minimal human intervention. The human’s role shifts from doing the mapping to reviewing the mapping.

The high-level data flow is illustrated below:

*Figure 1 – Data Discovery Agent pipeline: five stages from BigQuery data lake to executive dashboard.*

The architecture was inspired by a clean separation of concerns. The application core follows a layered pattern with dedicated service modules.

Service module inventory

The backend is organised into focused service modules, each responsible for a distinct domain:

Service Module	Core Responsibility
BigQuery Service	Connects to the BigQuery data lake, discovers datasets and tables, fetches column schemas and row counts, and saves timestamped fetch snapshots.
Sample Service	Extracts real sample rows from BigQuery tables using background-threaded parallel queries. Supports pause, resume, and abort controls.
Adverity Service	Crawls official Adverity connector documentation pages using Playwright, extracting structured field lists (name, description, dimension/metric).
Inference Service	Builds versioned LLM prompts, manages sync and batch inference pipelines via Vertex AI, and parses structured JSON responses into column-to-submodality mappings.
BQ Filter Service	Manages regex-based include/exclude rules that control which BigQuery datasets and tables appear in a fetch.
User Service	Handles user CRUD operations, domain allow-lists, email ban-lists, and role-based access control (admin/user).
LLM Config Service	Loads and manages LLM model configurations, including model identifiers and API parameters.
Vertex Model Service	Discovers available Vertex AI regions and publisher models for inference.

Key system capabilities

Autonomous Schema Discovery: The agent connects to BigQuery, applies configurable regex filters, and discovers all advertising-platform datasets, tables, and column schemas without manual intervention.
Evidence-Grounded LLM Inference: Every mapping decision is grounded in real data – column names, data types, and actual sample values – not just schema metadata. This dramatically reduces hallucination risk.
Dual-Direction Mapping: The system performs both forward mapping (BigQuery columns to canonical sub-modalities) and reverse mapping (Adverity connector fields to sub-modalities), closing the loop from “what do we have?” to “what should we enable?”
Completeness Scoring: Each platform receives a quantitative completeness score measuring how many of the 24 canonical sub-modalities are covered, turning vague data quality concerns into actionable metrics.
Versioned Prompt Engineering: All LLM prompts are version-tracked with SHA-256 content hashing, ensuring that every inference result can be traced back to the exact prompt that produced it.
Transparent Storage Abstraction: A unified storage layer allows the application to operate identically on local filesystems (development) and Google Cloud Storage (production) by changing a single environment variable.
Production-Grade Access Control: Google OAuth 2.0 authentication with domain allow-lists, email ban-lists, and admin/user role separation ensures secure, auditable access.

The modality schema: a canonical taxonomy

At the heart of the agent’s mapping logic is a canonical modality schema – a structured taxonomy that defines what advertising data means, independent of any platform’s naming conventions. The schema is organised around five high-level modalities, each representing a fundamental category of advertising measurement:

Modality	Sub-modalities	What It Captures
Performance	7	The numbers: spend, impressions, clicks, video plays, conversions, leads, CTR
Creative	4	What the ad looked like: creative IDs, asset paths, ad names, format types
Audience	6	Who saw it: gender, age group, interests, custom audiences, behavioural segments
Geo	5	Where they saw it: countries, regions, cities, postal codes, designated market areas
Brand	2	Who paid for it: brand name, advertiser identity

This yields 24 sub-modalities in total – the atomic units of the mapping. The LLM’s task is to determine, for each BigQuery column across all 179 tables, which sub-modality (if any) it corresponds to, and with what confidence.

The modality dictionary is fully editable through the admin UI, meaning the taxonomy can be extended or refined without touching code. It is injected into every LLM prompt as structured context, ensuring consistent mapping behaviour across inference runs.

Technology stack

Component	Technology	Role
Web Framework	FastAPI	Application backend, API routing, middleware, and session management
LLM Engine	Google Gemini (via Vertex AI)	Powers all schema-mapping and recommendation inference across the Gemini model family (including Gemini Flash for batch runs); chosen for low latency, strong instruction-following, and structured JSON output
Data Warehouse	Google BigQuery	Primary data source; the agent queries dataset schemas, table metadata, and sample rows
Batch Inference	Vertex AI Batch Prediction	Processes bulk LLM requests asynchronously via JSONL files staged in GCS; more cost-effective for full runs across all platforms
Documentation Crawling	Playwright	Headless browser automation for crawling Adverity connector documentation pages and extracting structured field lists
Frontend	Jinja2 + HTMX	Server-side rendered HTML templates with HTMX for interactive, partial-page updates without a JavaScript framework
Authentication	Google OAuth 2.0 (via Authlib)	Secure user login with domain allow-lists, email ban-lists, and admin/user role separation
Application Data	Google Cloud Storage	Persistent store for fetch results, inference outputs, platform configs, user records, and Adverity documentation
Deployment	Google Cloud Run	Serverless container hosting with auto-scaling, health probes, and secret-backed environment variables
Infrastructure	Terraform	Infrastructure-as-code: GCS buckets, Artifact Registry, Cloud Run service, IAM roles, Secret Manager secrets
Containerisation	Docker (multi-stage build)	Reproducible builds with a builder stage for dependency installation and a slim runtime stage running as non-root
Secret Management	Google Secret Manager	Stores OAuth credentials and session-signing keys, injected into Cloud Run as environment variables
Language	Python 3.11	All application code, managed via Poetry for dependency resolution

Pipeline deep-dive

The following five stages take the system from a raw, unannotated data warehouse to a fully mapped, scored, and actionable genome report.

Stage 1 – Discovery

The agent connects to Google BigQuery and discovers all staging datasets matching a configurable naming pattern (e.g., ^bq_cgh_mp_.*_staging$). For each dataset, it fetches the full table inventory: column names, data types, and row counts. Intelligent regex filters – managed through the admin UI via bq_filters.json – exclude temporary artefacts such as tables prefixed with stg_, suffixed with _tmp or _dbt_tmp, and any datasets that don’t match the expected naming convention.

Results are saved as a timestamped fetch snapshot (fetches/fetch_YYYYMMDD_HHMMSS/results.json). Multiple fetches can coexist, allowing admins to compare schema evolution over time. Each fetch captures the complete state of the warehouse at a point in time – the starting point for all downstream inference work.

Current warehouse dimensions:

Metric	Count
Advertising Platforms	14
BigQuery Datasets	17
Tables	179
Total Columns	2,709
Total Records	15.8 billion+

Platforms covered

#	Platform	#	Platform
1	Amazon DSP	8	Pinterest
2	Facebook Ads	9	Snapchat
3	Google Ads	10	TikTok
4	Google Ads (YouTube)	11	The Trade Desk
5	Google DV360	12	Twitter/X Ads
6	Google DV360 (YouTube)	13	Xandr
7	LinkedIn	14	IAS

Stage 2 – Sampling

Column names alone are often ambiguous. A column called cost could mean total spend, cost-per-click, or an internal ID. The agent resolves this ambiguity by extracting actual sample data rows from each table using background-threaded parallel queries against BigQuery.

These real values become critical evidence for the LLM: seeing 14.50, 0.83, 127.99 in a cost column strongly suggests monetary spend, not an identifier. Seeing M, F, Unknown in a gender column confirms an audience dimension more reliably than the column name alone.

The extraction process supports pause, resume, and abort controls – essential when sampling across hundreds of tables with billions of rows. Progress is tracked per-platform and persisted to storage, so an interrupted extraction can resume exactly where it left off. Samples are saved as samples_<platform>.json per fetch and are injected into the LLM prompt alongside the schema metadata.

Stage 3 – Documentation crawling

The agent crawls the official connector documentation for each platform – specifically, the authoritative “Most Used Fields” pages published by Adverity, the data integration platform that feeds data into the BigQuery warehouse. Using Playwright for headless browser automation, the crawler fetches each documentation page and extracts a structured field list: name, description, and whether the field is a dimension or metric.

Each crawled page is saved in three formats – HTML, Markdown, and structured JSON – under adverity_docs/ in the storage backend. These documents serve as the agent’s “reference manual”: a ground-truth checklist of what fields each platform could provide, independent of what currently exists in the warehouse.

The crawl engine runs as a background task with progress tracking, with a configurable delay between requests to avoid rate-limiting. Platform URLs are managed through the admin UI, making it straightforward to add new data sources.

Stage 4 – LLM inference

This is the core intellectual step – where the agent reads every column and produces an annotated mapping. The Inference Service builds versioned prompts containing three layers of context:

Table schemas – column names and data types for every table in the platform’s dataset.
Sample data – real row values extracted in Stage 2, providing grounding evidence.
Modality definitions – the target taxonomy of modalities and sub-modalities the LLM must map to.

These structured prompts are sent to models from the Google Gemini family via Vertex AI. The model returns structured JSON: for each canonical sub-modality, it identifies matching columns, assigns a confidence score (0.0 to 1.0), and provides natural-language reasoning for each match.

Inference modes

Two inference modes are supported, selectable through the admin UI:

Mode	Mechanism	Best For
Sync	One LLM request per platform, processed sequentially via `POST /api/inference/sync`	Quick testing, single-platform runs
Batch	Prompts written to JSONL files in GCS, submitted as Vertex AI batch prediction jobs, polled for completion	Full production runs across all 14 platforms; significantly more cost-effective

For batch mode, the workflow is: write prompts to JSONL, submit batch jobs, poll status until completion, collect and parse results. The UI provides real-time progress tracking for all active jobs.

Prompt engineering

All prompt templates live in app/prompts/ and are version-tracked with semantic versioning and SHA-256 content hashing via versions.json. This means every inference result records exactly which prompt version and content hash produced it – critical for reproducibility.

The base modality mapping prompt instructs the LLM to:

Examine every column name and data type across all provided tables.
Use sample data to confirm or reject potential matches.
Assign confidence scores on a defined scale (0.9-1.0 for clear matches, 0.5-0.69 for ambiguous, below 0.5 for weak).
Return only valid JSON – no markdown, no prose.

A dedicated Adverity recommendation prompt performs the reverse mapping: given Adverity connector documentation, recommend which Adverity fields map to which sub-modalities. This prompt enforces a critical rule – field names must appear verbatim in the documentation, preventing the LLM from inventing fields that don’t exist.

Robust JSON parsing

LLM outputs are rarely perfectly formatted. The Inference Service includes a robust JSON parser (_robust_parse_llm_json) that handles common LLM formatting mistakes: markdown code fences, trailing commas, single-quoted strings, BOM characters, leading/trailing prose around the JSON object, and escaped newlines inside strings. This defensive parsing layer ensures that minor formatting issues never cause a pipeline failure.

Three inference pipelines

The system supports three distinct inference types:

Modality Mapping Inference – Maps BigQuery columns to the canonical sub-modality taxonomy. The primary pipeline.
Adverity Recommendation Inference – Given Adverity documentation, recommends which connector fields should be enabled for each sub-modality.
Field Extraction Inference – Extracts structured field definitions from raw Adverity HTML/Markdown documentation using chunked batch inference, producing clean per-platform field maps.

Each run records metadata: timestamp, model used, prompt version, prompt content hash, and platforms processed.

Stage 5 – Enrichment and dashboard

The final stage closes the diagnostic loop by cross-referencing the warehouse mappings from Stage 4 against the connector field lists from Stage 3.

Enrichment recommendations

The Adverity Recommendation inference goes beyond simply flagging missing geo data. It prescribes a fix: “The connector for this platform offers a field called country_code (dimension); enable it, and your geo coverage improves from 2/5 to 3/5 sub-modalities.” Diagnosis and prescription in one step, turning gap analysis into concrete action items for the data engineering team.

Completeness scoring

Each platform receives a completeness score: how much of the canonical 24-sub-modality schema is actually present. The approach is deliberately simple – a presence/absence assay. For each platform, the denominator is 24 (total sub-modalities). The numerator is how many have at least one column mapped above a configurable confidence threshold. A sub-modality is either present or it isn’t – this handles deduplication naturally. If three tables each have a spend column mapped to performance__spend_usd, the sub-modality counts once, not three times.

The executive dashboard

All results render in a FastAPI + Jinja2/HTMX dashboard that provides:

Platform scorecards – one card per advertising platform showing its name, banner image, and completeness percentage.
Modality drilldowns – for each platform, expandable breakdowns into modalities and their sub-modalities.
Column mappings – the specific BigQuery columns mapped to each sub-modality, with confidence scores and LLM reasoning.
Sample data preview – actual row data from each table for manual verification.
Executive summary sidebar – aggregated statistics across all platforms: total mappings, average completeness, record counts, highest and lowest performing platforms.
Interactive mapping visualiser – a left-right mapping view, filterable by platform, modality, and confidence threshold.

Visibility of platforms is controlled via Dashboard Settings, where admins choose which fetch and which platforms appear to end users.

PDF executive report generation

The dashboard includes a one-click PDF executive report generator for stakeholders who need a polished, offline-readable document. When triggered, the system:

Aggregates all dashboard data – platform stats, completeness scores, modality breakdowns, and mapped column details – into a single render context.
Renders a dedicated Jinja2 report template (exec_report.html) – a professionally styled, multi-page A4 document with a cover page, executive overview, key metric summary cards, a platform comparison table, the full modality dictionary with descriptions, and per-platform detail sections showing completeness stats, modality coverage grids (with check/miss indicators per sub-modality), and optionally the full column mapping list.
Converts the HTML to PDF using Playwright‘s headless Chromium instance (page.pdf(format="A4", print_background=True)), producing a pixel-perfect render with page breaks, headers, and footers showing the generation timestamp and page numbers.
Returns the PDF as a downloadable file (e.g., exec_report_20260327_112709.pdf).

The report is designed for executive consumption: the first page presents headline metrics (total platforms, tables, columns, records, average completeness) and identifies the best and worst performers. Subsequent pages provide the modality dictionary for reference, followed by detailed per-platform breakdowns with modality coverage grids showing exactly which sub-modalities are mapped (with column counts) and which remain gaps. This gives leadership a complete, self-contained snapshot of data integration maturity without requiring dashboard access.

Cloud deployment

The system is deployed as a production-grade, cloud-native service on Google Cloud, following a containerised, infrastructure-as-code workflow from local development through to automated deployment.

Infrastructure-as-Code (Terraform)

All infrastructure is defined as Terraform in deployment/terraform/ with six modules:

Module	Resources Created
`apis`	Enables 10 GCP APIs (Cloud Run, Artifact Registry, Storage, BigQuery, Vertex AI, Secret Manager, IAM, Cloud Build, etc.)
`artifact_registry`	Docker image repository with a keep-last-5 cleanup policy
`gcs`	Versioned GCS bucket for application data (platforms, fetches, users, banners) with lifecycle rules
`iam`	Service account with roles: Storage Object Admin, BigQuery Data Viewer, BigQuery Job User, Vertex AI User, Secret Accessor
`secret_manager`	Three secrets: `SECRET_KEY`, `GOOGLE_CLIENT_ID`, `GOOGLE_CLIENT_SECRET`
`cloud_run`	Cloud Run v2 service with health probes, auto-scaling (0 to 3 instances), and secret-backed environment variables

Containerisation

The application is packaged as a Docker image using a multi-stage build:

Builder stage – Installs Python dependencies via Poetry, exports to requirements.txt, and installs packages into a clean prefix.
Runtime stage – Copies only the installed packages and application code into a slim Python 3.11 image. Runs as a non-root appuser for security. Seed data (platform configs, modality dictionaries, Adverity docs) is baked into the image; runtime data (fetches, users, inference results) lives in the GCS bucket.

Deployment pipeline

A full deployment is executed with a single command:

make deploy-all    # tf-init -> tf-apply -> docker-push -> seed-data

This runs Terraform to provision/update all infrastructure, builds and pushes the Docker image to Artifact Registry, and syncs the local data/ directory to the GCS bucket. Incremental deployments (code-only changes) use make deploy which skips the Terraform init step.

The seed-data target uses gsutil -m rsync to upload platform definitions, modality dictionaries, BQ filters, Adverity documentation, and other configuration data into the GCS bucket. Without seeding, a fresh deployment starts with an empty bucket and no platform or user configuration.

Storage abstraction layer

A critical architectural decision was the storage abstraction layer (app/core/storage.py). All data I/O goes through a unified StorageBackend interface with two implementations:

LocalStorage – reads/writes to the local data/ directory. Used during development (STORAGE_BACKEND=local).
GCSStorage – reads/writes to a GCS bucket under a configurable prefix. Used in production (STORAGE_BACKEND=gcs).

Switching between backends requires changing only the STORAGE_BACKEND environment variable. Every service module calls get_storage() and operates through the same API regardless of the underlying storage – ensuring that code tested locally behaves identically in production.

Authentication and access control

The application uses Google OAuth 2.0 (OpenID Connect) for user authentication:

Users log in with their Google account via the standard OAuth consent flow.
Domain allow-lists restrict login to specific email domains (e.g., satalia.com, choreograph.com).
Email ban-lists allow blocking specific users.
Role-based access separates admin users (full control panel) from regular users (executive dashboard only).
A bypass mode (BYPASS_AUTH_AS_ADMIN=true) enables local development without Google credentials.

All user and auth settings are managed through the admin UI and persisted via the storage backend.

Results and impact

When the pipeline completes a full run across all fourteen platforms, the executive dashboard delivers a comprehensive genome report of the data warehouse. The following results are drawn from the live executive summary generated by the dashboard.

Current warehouse state

Metric	Value
Platforms Monitored	14
Tables	179
Columns	2,709
Total Records	15.8B
Average Completeness	58%
Highest Completeness	Snapchat – 75% (18/24)
Lowest Completeness	Integral Ad Science – 25% (6/24)

Per-platform breakdown

Platform	Completeness	Tables	Columns	Records
Facebook Ads	66% (16/24)	15	364	3,147.7M
Google Ads	70% (17/24)	21	399	2,284.6M
TikTok	62% (15/24)	14	217	15.1M
LinkedIn	70% (17/24)	12	143	366.4k
Amazon DSP	62% (15/24)	18	213	18.2M
Integral Ad Science	25% (6/24)	12	130	95.9M
Xandr	50% (12/24)	11	156	1.2M
Google Ads (YouTube)	45% (11/24)	9	112	10.9M
Display & Video 360	70% (17/24)	16	275	10,148.3M
Display & Video 360 (YouTube)	54% (13/24)	15	187	63.6M
Pinterest	70% (17/24)	11	170	6.7M
Snapchat	75% (18/24)	10	150	15M
The Trade Desk	45% (11/24)	6	80	4.1M
Twitter Ads	50% (12/24)	9	113	40.4k

Headline findings

Four platforms – Google Ads, LinkedIn, Display & Video 360, and Pinterest – cluster at 70% completeness (17/24 sub-modalities). Snapchat leads the pack at 75%. Several platforms sit in the 50-66% range, and a handful fall below 50%.

Performance metrics (spend, impressions, clicks) are well-represented across the board – these are the columns every platform exposes and every data team queries first. But Audience, Geo, and Brand modalities show significant gaps, particularly in measurement and programmatic categories. Some findings were genuine surprises: platforms assumed to lack audience data turned out to carry age range and gender columns that mapped cleanly. The data was there all along – it just hadn’t been annotated.

At 58%, the warehouse is more than half-mapped – but meaningful blind spots remain. That’s a useful headline. It is unequivocally better to know the current state than to assume it’s healthy without running the test.

Operational impact

Before (Manual)	After (Agent)
2-4 weeks per full mapping pass	Hours for a complete run
Mapping decays immediately as schemas change	Re-run on demand; schema drift detected automatically
Tribal knowledge locked in spreadsheets	Structured, versioned, searchable mappings with confidence scores
Answering “do we have geo data from Pinterest?” required manual investigation	Dashboard lookup: instant answer with completeness score
Enrichment recommendations require manual cross-referencing	Automated prescriptions: “Enable `country_code` on this connector”

The shift is from doing the mapping to reviewing the mapping – a fundamentally different use of analyst time.

Admin workflow

The typical end-to-end admin workflow from raw data to a published dashboard follows this sequence:

Each step is performed through the web-based admin UI. The pipeline is intentionally manual at the trigger level – an admin decides when to run each stage – while the execution of each stage is fully automated. This design provides human oversight at decision points while eliminating manual drudgery within each step.

Conclusion

This technical walkthrough has presented the Data Discovery Agent: a production-grade, AI-powered platform that transforms the laborious process of advertising data warehouse annotation from weeks of manual spreadsheet work into an automated, repeatable, and auditable pipeline.

The agent’s five-stage architecture – Discovery, Sampling, Documentation Crawling, LLM Inference, and Enrichment – systematically builds context at each step so that the LLM’s mapping decisions are grounded in real evidence rather than speculation. The result is a fully annotated data warehouse with confidence-scored mappings, quantitative completeness metrics, and actionable enrichment recommendations.

Lessons learned

Grounding is Everything: Providing the LLM with real sample data alongside schema metadata was the single most impactful design decision. Column names alone are ambiguous; actual values resolve that ambiguity decisively.
Dual-Direction Mapping Closes the Loop: Mapping warehouse columns forward (what do we have?) and connector fields backward (what could we enable?) transforms the output from a passive inventory into an active roadmap.
Storage Abstraction Pays for Itself: The LocalStorage / GCSStorage abstraction – a seemingly minor architectural decision – eliminated an entire class of development-vs-production bugs and made the system genuinely portable from day one.
Versioned Prompts Enable Iteration: Content-hashing every prompt template and recording the hash alongside inference results made prompt engineering a disciplined, reproducible process rather than an ad-hoc exercise.
Build on a Unified Cloud Ecosystem: Building entirely on Google Cloud services – BigQuery, Vertex AI, Cloud Run, Cloud Storage, Secret Manager – eliminated integration friction between components and allowed the project to move from prototype to production deployment without stitching together tools from multiple vendors.

What we are building next

Scheduled Re-Scans: Automated daily or weekly re-runs with alerting when a platform’s schema mutates – detecting drift before it causes downstream problems.
Automatic Warehouse Verification: A verification layer that checks whether recommended Adverity fields are actually populated with non-null values, distinguishing between “this field exists” and “this field contains useful data.”
Text-to-SQL: A natural-language-to-SQL layer that uses the column mappings to answer ad-hoc questions against the warehouse – turning the annotated genome into a conversational data interface.

March 30, 2026

Why your data genome may need a check-up – and how a Data Discovery Agent can help
We collected every sample. We sequenced nothing.

The story of big data is, at its core, a twenty-year experiment in accumulation. Google’s landmark papers on the Google File System (2003) and MapReduce (2004) showed the world how to store and process data at unprecedented scale [1] [2]. The Hadoop ecosystem followed. Then the cloud era – BigQuery, Redshift, Snowflake – made collection cheaper, faster, and infinitely more elastic. The message was unambiguous: store everything, figure it out later.

And store they did. IDC projects that global data creation will grow at roughly 25% compound annual rate through 2028 [3]. Companies rushed to stake claims on every click, every impression, every conversion, convinced the data itself was the gold. But gold in the ground is worthless without extraction, refining, and assay. Most enterprises skipped those steps. Industry research finds that up to 90% of generated data remains unused, a phenomenon analysts have coined “dark data” [4]. The storage invoice arrives monthly. The insight dividend has not been declared yet [5] [6].

Enter the paradigm shift. Large Language Models (LLMs) – GPT-3 (2020), ChatGPT (2022), Google’s Gemini (2023-2024) – represent the first technology that can read data at the semantic level [7], understanding meaning rather than merely executing queries. A data warehouse full of unanalysed tables is, in many respects, like an unsequenced genome. The information is all there – every column a base pair, every table a chromosome – but without annotation, it’s just a very expensive string of letters.

Enterprise data is at that same inflection point. The sequencer has arrived. The genome is finally being read.

2,791 base pairs. No annotation. Good Luck, Have Fun, Don’t Die.

Every ad-tech data team carries an invisible tax: the hours, the errors, and the opportunity cost of manually reconciling platform schemas that were never designed to talk to each other. This is not a one-time project either but a recurring levy. Every time a platform updates its API, every time a new data source is onboarded, every time someone asks “do we even have geo data from Pinterest?” – the tax collector comes knocking.

At the heart of our work sits a centralised advertising data warehouse that aggregates campaign performance, creative assets, audience signals, geographic breakdowns, and brand metadata from every major digital advertising platform our organisation operates on. It is, in effect, the single source of truth for understanding how creative content performs across the entire digital media landscape. It is also, as we were about to discover, a genome that had never been sequenced.

We operate across fourteen advertising platforms – spanning major social, search, programmatic, and measurement partners. Each has its own schema. Facebook calls advertising expenditure amount_spent. Google calls it cost_micros (and means it, in millionths of a currency unit). TikTok simply says spend. They all mean roughly the same thing, but to a database and to a human analyst trying to build a cross-platform report, they might as well be different species encoding the same protein with entirely different codons (/ˈkəʊdɒn/ – three-letter DNA sequences that each specify the same amino acid, just spelled differently by each organism).

Data integration platforms and transformation layers help – they abstract some of the raw API complexity and reshape the data before it lands in the warehouse. In theory, these layers converge towards clarity. In practice, they shift the problem rather than solve it. By the time data reaches the staging tables analysts actually query, the original platform semantics have been filtered through multiple translation layers – each managed by a different team, each with its own conventions. The result is not less complexity; it is distributed complexity.

Now scale the problem. Our Google BigQuery data warehouse spans 179 tables and 2,709 columns across fourteen platforms, holding over 15.8 billion records. And those columns need to be understood by meaning, not merely by name.

One downstream use case requires them to be mapped to a canonical schema organised around five modalities – high-level categories that describe what a piece of advertising data is about:
- Performance – the numbers that tell you how a campaign did: spend, impressions, clicks, video views, conversions, leads.
- Creative – what the ad actually looked like: creative IDs, asset paths, ad names, format types.
- Audience – who saw it: gender, age group, interests, custom audiences, behavioural segments.
- Geo – where they saw it: countries, regions, cities, postal codes, designated market areas.
- Brand – who paid for it: brand name, advertiser identity.
Each modality breaks down further into 24 sub-modalities – the individual genes in our metaphor. That’s 2,709 potential annotations to make, verify, and maintain.

A skilled data analyst doing this manually needs to: open each table schema, read every column name and type, pull sample data to disambiguate, decide which canonical sub-modality it maps to, document the mapping, and then repeat thousands more times – praying that nothing changed since they started.

To make that concrete: imagine staring at a table with columns ad_id, date_start, spend, inline_link_clicks, cpc, frequency, reach, account_currency. Is spend total or daily? Is cpc cost-per-click or cost-per-conversion? Does reach mean unique users or total impressions? You run a SELECT * LIMIT 10, squint at the numbers, cross-reference the API docs, and after twenty minutes you’ve mapped eight columns from one table. Only 2,701 to go – across thirteen more platforms.

Conservatively, that’s two to four weeks of focused work for a single pass. And the result starts decaying immediately. Platforms update schemas quarterly, sometimes monthly. By the time the spreadsheet is “done,” it’s already wrong.

If you’ve ever opened a file called FINAL_mapping_v3_actually_final.xlsx, this section is for you.

In genomic terms: it’s annotating a genome by hand, one base pair at a time, with no automated sequencer and no reference genome. The Human Genome Project proved you can do it that way. It just took thirteen years and $2.7 billion. We wanted something faster.

The sequencer has entered the lab

What used to take a team of data analysts two to four weeks of grinding manual work (opening schemas, pulling samples, cross-referencing documentation, writing mappings into a spreadsheet) now runs in hours. The Data Discovery Agent processes all fourteen platforms, all 179 tables, all 2,709 columns, and delivers a fully annotated mapping with confidence scores.

Figure 1 – Manual mapping workflow vs. Agent pipeline: weeks of spreadsheet wrangling compressed into a single automated run.

The Data Discovery Agent is not a chatbot, nor a single prompt you paste into ChatGPT and hope for the best. It is a multi-stage autonomous platform that orchestrates a complete end-to-end pipeline of discovery, sampling, reasoning, and reporting with minimal human intervention. You point it at a data warehouse, define your canonical schema, and it does the rest: connecting to BigQuery, fetching every table and column, extracting sample rows for grounding, crawling external documentation for cross-reference, building structured prompts, invoking LLM inference, parsing the results, scoring confidence, calculating completeness, and rendering the whole thing in an interactive dashboard. The human’s job shifts from doing the mapping to reviewing the mapping: approving high-confidence matches, investigating ambiguous ones, and making strategic decisions about data gaps.

So, what is an agent?

In the AI and machine learning community, an agent is formally defined as a system that: (1) perceives its environment

(2) reasons about what actions to take

(3) executes those actions autonomously

(4) iterates towards a goal – often using tools and external data sources along the way.

Figure 2 – A simple visual of how an AI agent operates: Perceive → Reason → Act → Iterate.

By that definition, our Data Discovery Agent is genuinely an agent: it perceives the warehouse (schema fetching), reasons about semantics (LLM inference with structured prompts), takes action (mapping columns, scoring confidence, crawling documentation), and iterates (batch processing across platforms and modalities, second-pass enrichment recommendations). This is far more than a single LLM call wrapped in a REST endpoint. It’s an orchestrated pipeline of perception-reasoning-action loops.

Yes, the word “agent” has been stretched to meaninglessness in 2025-2026 – every API wrapper calls itself one. We use it deliberately, with the formal definition above, because our system genuinely exhibits autonomous multi-step goal-directed behaviour. If your “agent” is a prompt template with a for-loop, “script” might be a more honest label.

Inside the sequencer: how we taught an LLM to read a data warehouse

Teaching an LLM to read a data warehouse is not as simple as dumping a schema into a prompt. Column names are ambiguous. Data types are insufficient. Context is everything. Our pipeline gives the model maximum context at every step – schema structure, real sample values, external documentation, and a precisely defined target taxonomy – so that its annotations are grounded, not hallucinated. In genomic terms, this is the equivalent of handing a bioinformatician the raw sequence alongside the reference genome, the medical textbook, and a clear checklist of which genes to look for.

The following five stages take us from raw warehouse to annotated genome.

Stage 1 – Discovery

The agent connects to Google BigQuery and discovers all staging datasets – in our case, fourteen platforms matching a configurable naming pattern. For each dataset, it fetches the full table inventory: column names, data types, row counts. Intelligent filters exclude temporary artefacts (tables prefixed with stg_, suffixed with _tmp or _dbt_tmp) – the genomic equivalent of filtering out sequencing adapters before analysis.

Stage 2 – Sampling

Column names alone are often ambiguous. The agent resolves this by extracting actual sample data rows from each table, using background-threaded parallel queries. These real values become critical evidence for the LLM: seeing 14.50, 0.83, 127.99 in a cost column strongly suggests monetary spend, not an ID field. Column names are the labels on the test tubes; sample data is what’s actually inside them.

Stage 3 – Documentation crawling

The agent crawls the official connector documentation for each platform – the authoritative “Most Used Fields” pages published by our data integration provider – and extracts a structured field list: name, description, dimension or metric. This gives the agent a reference genome to cross-reference against what actually exists in the warehouse: the platform’s own user manuals turned into a checklist of expected signals.

Stage 4 – LLM inference

This is the core intellectual step. The agent builds versioned prompts containing the table schemas, sample data, and the target modality definition. These are sent to Google’s Batch Inference Service (VertexAI). The model returns structured JSON: for each canonical sub-modality, it identifies matching columns, assigns a confidence score (0.0–1.0), and provides natural-language reasoning for each match. This is the sequencing run itself – the machine reading every base pair and producing an annotated readout with quality scores.

Stage 5 – Enrichment and dashboard (the genome report)

A second-pass “Enrichment Recommender” inference cross-references the warehouse mappings from Stage 4 against the connector field lists from Stage 3. For every sub-modality where warehouse data is incomplete, it recommends which connector fields to enable – closing the loop from “what do we have?” to “what should we turn on?”

All results render in a FastAPI + Jinja2/HTMX dashboard: the karyotype (/ˈkariə(ʊ)tʌɪp/ the number and visual appearance of the chromosomes in the cell nuclei of an organism or species), the gene annotation browser, and the clinical recommendations in one interface.

The genome report: what 2,709 annotations revealed

When the sequencing run completes, you don’t get a wall of text. You get a genome report in the shape of a visual, interactive, filterable dashboard that tells you, at a glance, which parts of your data genome are fully annotated and which have gaps.

Figure 3 – Platform completeness scorecards: the karyotype of our data warehouse at a glance.

Each platform receives a “completeness” score: how much of the canonical schema is actually present. The approach is deliberately simple – a presence/absence assay. For each platform, the denominator is 24 (total sub-modalities). The numerator is how many have at least one column mapped above a configurable confidence threshold. A sub-modality is either present or it isn’t, like a gene that’s either expressed or silent. This also handles deduplication: if three tables each have a spend column mapped to performance__spend_usd, the sub-modality counts once, not three times. The formula cares about breadth of coverage, not depth of redundancy.

Figure 4 – The interactive dashboard: Submodality matching recommendation cards

Across all fourteen platforms, average mapping completeness stands at 58%. The top performer reached 75% (18/24 sub-modalities covered), while the platform with the greatest room for improvement sat at 25% (6/24). Four platforms cluster at 70%, several sit in the 50–66% range, and a handful fall below 50%.

Some findings were pleasant surprises – platforms assumed to lack audience data turned out to carry age range and gender columns that mapped cleanly. The data was there all along; it just hadn’t been annotated. Other findings were equally valuable: Performance metrics (spend, impressions, clicks) are well-represented across the board, but Audience, Geo, and Brand modalities show significant gaps, particularly in measurement and programmatic categories.

At 58%, the genome is more than half-sequenced – but meaningful blind spots remain. That’s a useful headline. It is unequivocally better to know your genome’s current state than to assume it’s healthy without running the test. The scorecard turns vague unease into actionable intelligence (“We’re missing geo data on most platforms, and here’s exactly which fields to enable”). It also delivers a reality check: the foundation is stronger than expected in some areas and weaker than assumed in others. But now we know exactly where to invest.

The Enrichment Recommender closes the diagnostic loop. Rather than simply flagging missing geo data, it prescribes a fix: “The connector for this platform offers a field called country_code (dimension); enable it, and your geo coverage improves.” Diagnosis and prescription in one step, and a natural bridge to the next phase of our work.

What we’re building now

The genome report is valuable on its own. But a sequencer that only reads and never acts is leaving half the value on the table. We are currently working on three initiatives that extend the agent from discovery into action; from reading the genome to editing it.

1. Productionalisation – from lab bench to clinical practice

Today the agent runs as an internal tool, triggered manually. We are moving it to production-grade: CI/CD pipelines, scheduled re-scans (daily, weekly, on-schema-change), and automated alerting when a platform’s schema mutates. The genomic equivalent of detecting a mutation before it causes problems downstream. A sequencer that runs itself.

2. Automatic warehouse verification – is the gene actually expressed?

The Enrichment Recommender tells you which fields exist in the documentation. But existence does not mean expression. A field can be listed, enabled in the connector, and still arrive empty because the advertiser never populated it.

We are building a verification layer that connects to the data integration platform’s API, pulls each connector’s field list, and checks the warehouse: Is this column actually populated? What percentage of rows have non-null values?

In DNA terms: it’s one thing to know a gene exists in the reference genome; it’s another to confirm it’s expressed in your organism. The practical outcome: instead of blind recommendations, the agent will say “Enable country_code – 87% of advertisers populate it” versus “dma_region has only 3% non-null rows, probably not worth the effort.”

3. Text-to-SQL – chatting with the genome

Discovery answers “what data do we have?” The next question is always “what does the data say?” We are building a natural-language-to-SQL layer: ask “What was our total spend across social platforms last quarter?” and the agent translates it to a BigQuery query – using the column mappings it already knows – executes it, and returns a human-readable answer with the underlying SQL for transparency.

This positions the platform as a full conversational data layer: discover the genome, annotate it, score completeness, recommend enrichment, and then interrogate it in natural language. From specimen collection to clinical consultation, without leaving the lab.

The genome is sequenced. The real work begins.

For two decades, the advertising industry collected data the way early geneticists collected specimens: enthusiastically, comprehensively, and with very little annotation. We built warehouses the size of genomes and staffed them with analysts armed with spreadsheets – the biological equivalent of a magnifying glass and a very long afternoon.

The Data Discovery Agent is what happens when you finally bring a sequencer into the lab. It reads every base pair, annotates every gene, scores every alignment, flags every gap, and delivers a genome report that turns “we have data” into “we understand our data.” And unlike the Human Genome Project, it does not take thirteen years. It takes hours.

The genome is sequenced. The annotations are in. The gaps are mapped. Now the real work begins: not the tedium of cataloguing, but the science of insight. And if the sequencer keeps getting smarter (spoiler: it will), the next chapter will be about more than reading the data. It will be about having a conversation with it.

Ready to look under the hood? See Data Discovery Agent – Technical Walkthrough for the full engineering deep-dive – architecture, five-stage pipeline, Vertex AI inference, FastAPI deployment on Cloud Run, and code-level detail.
1. Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. Proceedings of OSDI ’04.
2. Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google File System. Proceedings of the 19th ACM SOSP.
3. IDC, Worldwide Global DataSphere Forecast 2023-2028.
4. Splunk dark data survey, summarised in “Dark Data: Unlocking the ~90% You Don’t Use” (2025).
5. McKinsey & Company, “In search of cloud value: Can generative AI transform cloud ROI?” (2023).
6. Forbes, “From Data Hoarding To Data Strategy, Building AI That Actually Works” (2025).
7. Forrester, AI, data, and analytics predictions (2024).
Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP AI Lab team.
March 30, 2026

Author: Tamás Lukács

New Blog Post

Data Discovery Agent Pod

TL;DR

Technical walkthrough

Introduction

Agent architecture

Service module inventory

Key system capabilities

The modality schema: a canonical taxonomy

Technology stack

Pipeline deep-dive

Stage 1 – Discovery

Platforms covered

Stage 2 – Sampling

Stage 3 – Documentation crawling

Stage 4 – LLM inference

Inference modes

Prompt engineering

Robust JSON parsing

Three inference pipelines

Stage 5 – Enrichment and dashboard

Enrichment recommendations

Completeness scoring

The executive dashboard

PDF executive report generation

Cloud deployment

Infrastructure-as-Code (Terraform)

Containerisation

Deployment pipeline

Storage abstraction layer

Authentication and access control

Results and impact

Current warehouse state

Per-platform breakdown

Headline findings

Operational impact

Admin workflow

Conclusion

Lessons learned

What we are building next

Why your data genome may need a check-up – and how a Data Discovery Agent can help

We collected every sample. We sequenced nothing.

2,791 base pairs. No annotation. Good Luck, Have Fun, Don’t Die.

The sequencer has entered the lab

So, what is an agent?

Inside the sequencer: how we taught an LLM to read a data warehouse

Stage 1 – Discovery

Stage 2 – Sampling

Stage 3 – Documentation crawling

Stage 4 – LLM inference

Stage 5 – Enrichment and dashboard (the genome report)

The genome report: what 2,709 annotations revealed

What we’re building now

1. Productionalisation – from lab bench to clinical practice

2. Automatic warehouse verification – is the gene actually expressed?

3. Text-to-SQL – chatting with the genome

The genome is sequenced. The real work begins.