Author: Tamás Lukács

  • Data Discovery Agent Pod

    Data Discovery Agent Pod

    Turning raw ad platform schemas into actionable intelligence through AI-powered semantic mapping.

    TL;DR

    Up to 90% of enterprise data sits unused, and our BigQuery advertising warehouse was no exception: 14 platforms, 179 tables, 2,709 columns, 15.8 billion records, zero shared schema, and answering a question as simple as whether Pinterest carries geo data required a two-week manual investigation. Every platform invented its own vocabulary (Facebook says amount_spent, Google says cost_micros, TikTok says spend) and manual reconciliation cost analysts two to four weeks per pass, producing a spreadsheet that started decaying the moment it was finished. 

    We built the Data Discovery Agent, an AI-powered pipeline that connects to BigQuery, samples real data for grounding, crawls external connector documentation, and runs Gemini LLM inference with versioned prompts to autonomously annotate every column with confidence-scored mappings. The agent compressed weeks of manual work into hours, took us from zero structured understanding of our warehouse to a 58% average completeness (column presence per canonical sub-modality; signal strength and population validation planned for Q2) baseline across all 14 platforms, and replaced tribal-knowledge spreadsheets with a searchable, versioned genome report that turns “do we have that data?” from a research project into a dashboard lookup.

    If you don’t care about the technical details, read our blog post instead.


    Technical walkthrough

    Introduction

    Advertising data warehouses are, by nature, enormous and messy. Every major ad platform – Facebook, Google, TikTok, and a dozen others – exposes its own schema, its own naming conventions, and its own definition of what a “click” or a “spend” actually means. When these platforms feed into a centralised Google BigQuery data lake, the result is a sprawling collection of tables and columns that no single analyst can hold in their head. In our case: 14 advertising platforms, 179 tables, and 2,709 columns – with over 15.8 billion records behind them.

    Historically, making sense of this warehouse required manual schema annotation: a data analyst opening each table, reading every column name, pulling sample rows, cross-referencing API documentation, and deciding which canonical category each column belonged to. Conservatively, that process took two to four weeks of focused work for a single pass. And the result started decaying immediately – platforms update schemas quarterly, sometimes monthly.

    To eliminate this bottleneck, we architected and deployed the Data Discovery Agent: an AI-powered platform that autonomously discovers, maps, and scores the completeness of advertising data across an entire BigQuery data lake. The agent connects to BigQuery, fetches every table and column, extracts real sample data for grounding, crawls external documentation for cross-reference, invokes Vertex AI LLM inference with structured prompts, and delivers a fully annotated mapping with confidence scores – compressing weeks of manual work into hours.

    The system is built as a production-grade FastAPI web application, deployed on Google Cloud Run, with a complete admin control panel, role-based access via Google OAuth, and a storage abstraction layer that operates identically on local filesystems and Google Cloud Storage. This document provides a comprehensive technical walkthrough of its architecture, pipeline, deployment, and results.


    Agent architecture

    The Data Discovery Agent is structured as a five-stage sequential pipeline orchestrated by a FastAPI backend. Unlike a chatbot or a single-prompt wrapper, the agent executes an end-to-end workflow of discovery, sampling, documentation crawling, LLM inference, and enrichment reporting – with minimal human intervention. The human’s role shifts from doing the mapping to reviewing the mapping.

    The high-level data flow is illustrated below:

    Figure 1 – Data Discovery Agent pipeline: five stages from BigQuery data lake to executive dashboard.


    The architecture was inspired by a clean separation of concerns. The application core follows a layered pattern with dedicated service modules.

    Service module inventory

    The backend is organised into focused service modules, each responsible for a distinct domain:

    Service ModuleCore Responsibility
    BigQuery ServiceConnects to the BigQuery data lake, discovers datasets and tables, fetches column schemas and row counts, and saves timestamped fetch snapshots.
    Sample ServiceExtracts real sample rows from BigQuery tables using background-threaded parallel queries. Supports pause, resume, and abort controls.
    Adverity ServiceCrawls official Adverity connector documentation pages using Playwright, extracting structured field lists (name, description, dimension/metric).
    Inference ServiceBuilds versioned LLM prompts, manages sync and batch inference pipelines via Vertex AI, and parses structured JSON responses into column-to-submodality mappings.
    BQ Filter ServiceManages regex-based include/exclude rules that control which BigQuery datasets and tables appear in a fetch.
    User ServiceHandles user CRUD operations, domain allow-lists, email ban-lists, and role-based access control (admin/user).
    LLM Config ServiceLoads and manages LLM model configurations, including model identifiers and API parameters.
    Vertex Model ServiceDiscovers available Vertex AI regions and publisher models for inference.

    Key system capabilities

    1. Autonomous Schema Discovery: The agent connects to BigQuery, applies configurable regex filters, and discovers all advertising-platform datasets, tables, and column schemas without manual intervention.
    2. Evidence-Grounded LLM Inference: Every mapping decision is grounded in real data – column names, data types, and actual sample values – not just schema metadata. This dramatically reduces hallucination risk.
    3. Dual-Direction Mapping: The system performs both forward mapping (BigQuery columns to canonical sub-modalities) and reverse mapping (Adverity connector fields to sub-modalities), closing the loop from “what do we have?” to “what should we enable?”
    4. Completeness Scoring: Each platform receives a quantitative completeness score measuring how many of the 24 canonical sub-modalities are covered, turning vague data quality concerns into actionable metrics.
    5. Versioned Prompt Engineering: All LLM prompts are version-tracked with SHA-256 content hashing, ensuring that every inference result can be traced back to the exact prompt that produced it.
    6. Transparent Storage Abstraction: A unified storage layer allows the application to operate identically on local filesystems (development) and Google Cloud Storage (production) by changing a single environment variable.
    7. Production-Grade Access Control: Google OAuth 2.0 authentication with domain allow-lists, email ban-lists, and admin/user role separation ensures secure, auditable access.

    The modality schema: a canonical taxonomy

    At the heart of the agent’s mapping logic is a canonical modality schema – a structured taxonomy that defines what advertising data means, independent of any platform’s naming conventions. The schema is organised around five high-level modalities, each representing a fundamental category of advertising measurement:

    ModalitySub-modalitiesWhat It Captures
    Performance7The numbers: spend, impressions, clicks, video plays, conversions, leads, CTR
    Creative4What the ad looked like: creative IDs, asset paths, ad names, format types
    Audience6Who saw it: gender, age group, interests, custom audiences, behavioural segments
    Geo5Where they saw it: countries, regions, cities, postal codes, designated market areas
    Brand2Who paid for it: brand name, advertiser identity

    This yields 24 sub-modalities in total – the atomic units of the mapping. The LLM’s task is to determine, for each BigQuery column across all 179 tables, which sub-modality (if any) it corresponds to, and with what confidence.

    The modality dictionary is fully editable through the admin UI, meaning the taxonomy can be extended or refined without touching code. It is injected into every LLM prompt as structured context, ensuring consistent mapping behaviour across inference runs.


    Technology stack

    ComponentTechnologyRole
    Web FrameworkFastAPIApplication backend, API routing, middleware, and session management
    LLM EngineGoogle Gemini (via Vertex AI)Powers all schema-mapping and recommendation inference across the Gemini model family (including Gemini Flash for batch runs); chosen for low latency, strong instruction-following, and structured JSON output
    Data WarehouseGoogle BigQueryPrimary data source; the agent queries dataset schemas, table metadata, and sample rows
    Batch InferenceVertex AI Batch PredictionProcesses bulk LLM requests asynchronously via JSONL files staged in GCS; more cost-effective for full runs across all platforms
    Documentation CrawlingPlaywrightHeadless browser automation for crawling Adverity connector documentation pages and extracting structured field lists
    FrontendJinja2 + HTMXServer-side rendered HTML templates with HTMX for interactive, partial-page updates without a JavaScript framework
    AuthenticationGoogle OAuth 2.0 (via Authlib)Secure user login with domain allow-lists, email ban-lists, and admin/user role separation
    Application DataGoogle Cloud StoragePersistent store for fetch results, inference outputs, platform configs, user records, and Adverity documentation
    DeploymentGoogle Cloud RunServerless container hosting with auto-scaling, health probes, and secret-backed environment variables
    InfrastructureTerraformInfrastructure-as-code: GCS buckets, Artifact Registry, Cloud Run service, IAM roles, Secret Manager secrets
    ContainerisationDocker (multi-stage build)Reproducible builds with a builder stage for dependency installation and a slim runtime stage running as non-root
    Secret ManagementGoogle Secret ManagerStores OAuth credentials and session-signing keys, injected into Cloud Run as environment variables
    LanguagePython 3.11All application code, managed via Poetry for dependency resolution

    Pipeline deep-dive

    The following five stages take the system from a raw, unannotated data warehouse to a fully mapped, scored, and actionable genome report.

    Stage 1 – Discovery

    The agent connects to Google BigQuery and discovers all staging datasets matching a configurable naming pattern (e.g., ^bq_cgh_mp_.*_staging$). For each dataset, it fetches the full table inventory: column names, data types, and row counts. Intelligent regex filters – managed through the admin UI via bq_filters.json – exclude temporary artefacts such as tables prefixed with stg_, suffixed with _tmp or _dbt_tmp, and any datasets that don’t match the expected naming convention.

    Results are saved as a timestamped fetch snapshot (fetches/fetch_YYYYMMDD_HHMMSS/results.json). Multiple fetches can coexist, allowing admins to compare schema evolution over time. Each fetch captures the complete state of the warehouse at a point in time – the starting point for all downstream inference work.

    Current warehouse dimensions:

    MetricCount
    Advertising Platforms14
    BigQuery Datasets17
    Tables179
    Total Columns2,709
    Total Records15.8 billion+

    Platforms covered

    #Platform#Platform
    1Amazon DSP8Pinterest
    2Facebook Ads9Snapchat
    3Google Ads10TikTok
    4Google Ads (YouTube)11The Trade Desk
    5Google DV36012Twitter/X Ads
    6Google DV360 (YouTube)13Xandr
    7LinkedIn14IAS

    Stage 2 – Sampling

    Column names alone are often ambiguous. A column called cost could mean total spend, cost-per-click, or an internal ID. The agent resolves this ambiguity by extracting actual sample data rows from each table using background-threaded parallel queries against BigQuery.

    These real values become critical evidence for the LLM: seeing 14.500.83127.99 in a cost column strongly suggests monetary spend, not an identifier. Seeing MFUnknown in a gender column confirms an audience dimension more reliably than the column name alone.

    The extraction process supports pause, resume, and abort controls – essential when sampling across hundreds of tables with billions of rows. Progress is tracked per-platform and persisted to storage, so an interrupted extraction can resume exactly where it left off. Samples are saved as samples_<platform>.json per fetch and are injected into the LLM prompt alongside the schema metadata.

    Stage 3 – Documentation crawling

    The agent crawls the official connector documentation for each platform – specifically, the authoritative “Most Used Fields” pages published by Adverity, the data integration platform that feeds data into the BigQuery warehouse. Using Playwright for headless browser automation, the crawler fetches each documentation page and extracts a structured field list: name, description, and whether the field is a dimension or metric.

    Each crawled page is saved in three formats – HTML, Markdown, and structured JSON – under adverity_docs/ in the storage backend. These documents serve as the agent’s “reference manual”: a ground-truth checklist of what fields each platform could provide, independent of what currently exists in the warehouse.

    The crawl engine runs as a background task with progress tracking, with a configurable delay between requests to avoid rate-limiting. Platform URLs are managed through the admin UI, making it straightforward to add new data sources.

    Stage 4 – LLM inference

    This is the core intellectual step – where the agent reads every column and produces an annotated mapping. The Inference Service builds versioned prompts containing three layers of context:

    1. Table schemas – column names and data types for every table in the platform’s dataset.
    2. Sample data – real row values extracted in Stage 2, providing grounding evidence.
    3. Modality definitions – the target taxonomy of modalities and sub-modalities the LLM must map to.

    These structured prompts are sent to models from the Google Gemini family via Vertex AI. The model returns structured JSON: for each canonical sub-modality, it identifies matching columns, assigns a confidence score (0.0 to 1.0), and provides natural-language reasoning for each match.

    Inference modes

    Two inference modes are supported, selectable through the admin UI:

    ModeMechanismBest For
    SyncOne LLM request per platform, processed sequentially via POST /api/inference/syncQuick testing, single-platform runs
    BatchPrompts written to JSONL files in GCS, submitted as Vertex AI batch prediction jobs, polled for completionFull production runs across all 14 platforms; significantly more cost-effective

    For batch mode, the workflow is: write prompts to JSONL, submit batch jobs, poll status until completion, collect and parse results. The UI provides real-time progress tracking for all active jobs.

    Prompt engineering

    All prompt templates live in app/prompts/ and are version-tracked with semantic versioning and SHA-256 content hashing via versions.json. This means every inference result records exactly which prompt version and content hash produced it – critical for reproducibility.

    The base modality mapping prompt instructs the LLM to:

    • Examine every column name and data type across all provided tables.
    • Use sample data to confirm or reject potential matches.
    • Assign confidence scores on a defined scale (0.9-1.0 for clear matches, 0.5-0.69 for ambiguous, below 0.5 for weak).
    • Return only valid JSON – no markdown, no prose.

    A dedicated Adverity recommendation prompt performs the reverse mapping: given Adverity connector documentation, recommend which Adverity fields map to which sub-modalities. This prompt enforces a critical rule – field names must appear verbatim in the documentation, preventing the LLM from inventing fields that don’t exist.

    Robust JSON parsing

    LLM outputs are rarely perfectly formatted. The Inference Service includes a robust JSON parser (_robust_parse_llm_json) that handles common LLM formatting mistakes: markdown code fences, trailing commas, single-quoted strings, BOM characters, leading/trailing prose around the JSON object, and escaped newlines inside strings. This defensive parsing layer ensures that minor formatting issues never cause a pipeline failure.

    Three inference pipelines

    The system supports three distinct inference types:

    1. Modality Mapping Inference – Maps BigQuery columns to the canonical sub-modality taxonomy. The primary pipeline.
    2. Adverity Recommendation Inference – Given Adverity documentation, recommends which connector fields should be enabled for each sub-modality.
    3. Field Extraction Inference – Extracts structured field definitions from raw Adverity HTML/Markdown documentation using chunked batch inference, producing clean per-platform field maps.

    Each run records metadata: timestamp, model used, prompt version, prompt content hash, and platforms processed.

    Stage 5 – Enrichment and dashboard

    The final stage closes the diagnostic loop by cross-referencing the warehouse mappings from Stage 4 against the connector field lists from Stage 3.

    Enrichment recommendations

    The Adverity Recommendation inference goes beyond simply flagging missing geo data. It prescribes a fix: “The connector for this platform offers a field called country_code (dimension); enable it, and your geo coverage improves from 2/5 to 3/5 sub-modalities.” Diagnosis and prescription in one step, turning gap analysis into concrete action items for the data engineering team.

    Completeness scoring

    Each platform receives a completeness score: how much of the canonical 24-sub-modality schema is actually present. The approach is deliberately simple – a presence/absence assay. For each platform, the denominator is 24 (total sub-modalities). The numerator is how many have at least one column mapped above a configurable confidence threshold. A sub-modality is either present or it isn’t – this handles deduplication naturally. If three tables each have a spend column mapped to performance__spend_usd, the sub-modality counts once, not three times.

    The executive dashboard

    All results render in a FastAPI + Jinja2/HTMX dashboard that provides:

    • Platform scorecards – one card per advertising platform showing its name, banner image, and completeness percentage.
    • Modality drilldowns – for each platform, expandable breakdowns into modalities and their sub-modalities.
    • Column mappings – the specific BigQuery columns mapped to each sub-modality, with confidence scores and LLM reasoning.
    • Sample data preview – actual row data from each table for manual verification.
    • Executive summary sidebar – aggregated statistics across all platforms: total mappings, average completeness, record counts, highest and lowest performing platforms.
    • Interactive mapping visualiser – a left-right mapping view, filterable by platform, modality, and confidence threshold.

    Visibility of platforms is controlled via Dashboard Settings, where admins choose which fetch and which platforms appear to end users.

    PDF executive report generation

    The dashboard includes a one-click PDF executive report generator for stakeholders who need a polished, offline-readable document. When triggered, the system:

    1. Aggregates all dashboard data – platform stats, completeness scores, modality breakdowns, and mapped column details – into a single render context.
    2. Renders a dedicated Jinja2 report template (exec_report.html) – a professionally styled, multi-page A4 document with a cover page, executive overview, key metric summary cards, a platform comparison table, the full modality dictionary with descriptions, and per-platform detail sections showing completeness stats, modality coverage grids (with check/miss indicators per sub-modality), and optionally the full column mapping list.
    3. Converts the HTML to PDF using Playwright‘s headless Chromium instance (page.pdf(format="A4", print_background=True)), producing a pixel-perfect render with page breaks, headers, and footers showing the generation timestamp and page numbers.
    4. Returns the PDF as a downloadable file (e.g., exec_report_20260327_112709.pdf).

    The report is designed for executive consumption: the first page presents headline metrics (total platforms, tables, columns, records, average completeness) and identifies the best and worst performers. Subsequent pages provide the modality dictionary for reference, followed by detailed per-platform breakdowns with modality coverage grids showing exactly which sub-modalities are mapped (with column counts) and which remain gaps. This gives leadership a complete, self-contained snapshot of data integration maturity without requiring dashboard access.


    Cloud deployment

    The system is deployed as a production-grade, cloud-native service on Google Cloud, following a containerised, infrastructure-as-code workflow from local development through to automated deployment.

    Infrastructure-as-Code (Terraform)

    All infrastructure is defined as Terraform in deployment/terraform/ with six modules:

    ModuleResources Created
    apisEnables 10 GCP APIs (Cloud Run, Artifact Registry, Storage, BigQuery, Vertex AI, Secret Manager, IAM, Cloud Build, etc.)
    artifact_registryDocker image repository with a keep-last-5 cleanup policy
    gcsVersioned GCS bucket for application data (platforms, fetches, users, banners) with lifecycle rules
    iamService account with roles: Storage Object Admin, BigQuery Data Viewer, BigQuery Job User, Vertex AI User, Secret Accessor
    secret_managerThree secrets: SECRET_KEYGOOGLE_CLIENT_IDGOOGLE_CLIENT_SECRET
    cloud_runCloud Run v2 service with health probes, auto-scaling (0 to 3 instances), and secret-backed environment variables

    Containerisation

    The application is packaged as a Docker image using a multi-stage build:

    1. Builder stage – Installs Python dependencies via Poetry, exports to requirements.txt, and installs packages into a clean prefix.
    2. Runtime stage – Copies only the installed packages and application code into a slim Python 3.11 image. Runs as a non-root appuser for security. Seed data (platform configs, modality dictionaries, Adverity docs) is baked into the image; runtime data (fetches, users, inference results) lives in the GCS bucket.

    Deployment pipeline

    A full deployment is executed with a single command:

    make deploy-all    # tf-init -> tf-apply -> docker-push -> seed-data

    This runs Terraform to provision/update all infrastructure, builds and pushes the Docker image to Artifact Registry, and syncs the local data/ directory to the GCS bucket. Incremental deployments (code-only changes) use make deploy which skips the Terraform init step.

    The seed-data target uses gsutil -m rsync to upload platform definitions, modality dictionaries, BQ filters, Adverity documentation, and other configuration data into the GCS bucket. Without seeding, a fresh deployment starts with an empty bucket and no platform or user configuration.

    Storage abstraction layer

    A critical architectural decision was the storage abstraction layer (app/core/storage.py). All data I/O goes through a unified StorageBackend interface with two implementations:

    • LocalStorage – reads/writes to the local data/ directory. Used during development (STORAGE_BACKEND=local).
    • GCSStorage – reads/writes to a GCS bucket under a configurable prefix. Used in production (STORAGE_BACKEND=gcs).

    Switching between backends requires changing only the STORAGE_BACKEND environment variable. Every service module calls get_storage() and operates through the same API regardless of the underlying storage – ensuring that code tested locally behaves identically in production.

    Authentication and access control

    The application uses Google OAuth 2.0 (OpenID Connect) for user authentication:

    • Users log in with their Google account via the standard OAuth consent flow.
    • Domain allow-lists restrict login to specific email domains (e.g., satalia.comchoreograph.com).
    • Email ban-lists allow blocking specific users.
    • Role-based access separates admin users (full control panel) from regular users (executive dashboard only).
    • bypass mode (BYPASS_AUTH_AS_ADMIN=true) enables local development without Google credentials.

    All user and auth settings are managed through the admin UI and persisted via the storage backend.


    Results and impact

    When the pipeline completes a full run across all fourteen platforms, the executive dashboard delivers a comprehensive genome report of the data warehouse. The following results are drawn from the live executive summary generated by the dashboard.

    Current warehouse state

    MetricValue
    Platforms Monitored14
    Tables179
    Columns2,709
    Total Records15.8B
    Average Completeness58%
    Highest CompletenessSnapchat – 75% (18/24)
    Lowest CompletenessIntegral Ad Science – 25% (6/24)

    Per-platform breakdown

    PlatformCompletenessTablesColumnsRecords
    Facebook Ads66% (16/24)153643,147.7M
    Google Ads70% (17/24)213992,284.6M
    TikTok62% (15/24)1421715.1M
    LinkedIn70% (17/24)12143366.4k
    Amazon DSP62% (15/24)1821318.2M
    Integral Ad Science25% (6/24)1213095.9M
    Xandr50% (12/24)111561.2M
    Google Ads (YouTube)45% (11/24)911210.9M
    Display & Video 36070% (17/24)1627510,148.3M
    Display & Video 360 (YouTube)54% (13/24)1518763.6M
    Pinterest70% (17/24)111706.7M
    Snapchat75% (18/24)1015015M
    The Trade Desk45% (11/24)6804.1M
    Twitter Ads50% (12/24)911340.4k

    Headline findings

    Four platforms – Google Ads, LinkedIn, Display & Video 360, and Pinterest – cluster at 70% completeness (17/24 sub-modalities). Snapchat leads the pack at 75%. Several platforms sit in the 50-66% range, and a handful fall below 50%.

    Performance metrics (spend, impressions, clicks) are well-represented across the board – these are the columns every platform exposes and every data team queries first. But Audience, Geo, and Brand modalities show significant gaps, particularly in measurement and programmatic categories. Some findings were genuine surprises: platforms assumed to lack audience data turned out to carry age range and gender columns that mapped cleanly. The data was there all along – it just hadn’t been annotated.

    At 58%, the warehouse is more than half-mapped – but meaningful blind spots remain. That’s a useful headline. It is unequivocally better to know the current state than to assume it’s healthy without running the test.

    Operational impact

    Before (Manual)After (Agent)
    2-4 weeks per full mapping passHours for a complete run
    Mapping decays immediately as schemas changeRe-run on demand; schema drift detected automatically
    Tribal knowledge locked in spreadsheetsStructured, versioned, searchable mappings with confidence scores
    Answering “do we have geo data from Pinterest?” required manual investigationDashboard lookup: instant answer with completeness score
    Enrichment recommendations require manual cross-referencingAutomated prescriptions: “Enable country_code on this connector”

    The shift is from doing the mapping to reviewing the mapping – a fundamentally different use of analyst time.


    Admin workflow

    The typical end-to-end admin workflow from raw data to a published dashboard follows this sequence:

    Figure 2 – End-to-end admin workflow: eleven steps from platform configuration to a live dashboard, with a feedback loop for prompt iteration.

    Each step is performed through the web-based admin UI. The pipeline is intentionally manual at the trigger level – an admin decides when to run each stage – while the execution of each stage is fully automated. This design provides human oversight at decision points while eliminating manual drudgery within each step.


    Conclusion

    This technical walkthrough has presented the Data Discovery Agent: a production-grade, AI-powered platform that transforms the laborious process of advertising data warehouse annotation from weeks of manual spreadsheet work into an automated, repeatable, and auditable pipeline.

    The agent’s five-stage architecture – Discovery, Sampling, Documentation Crawling, LLM Inference, and Enrichment – systematically builds context at each step so that the LLM’s mapping decisions are grounded in real evidence rather than speculation. The result is a fully annotated data warehouse with confidence-scored mappings, quantitative completeness metrics, and actionable enrichment recommendations.

    Lessons learned

    • Grounding is Everything: Providing the LLM with real sample data alongside schema metadata was the single most impactful design decision. Column names alone are ambiguous; actual values resolve that ambiguity decisively.
    • Dual-Direction Mapping Closes the Loop: Mapping warehouse columns forward (what do we have?) and connector fields backward (what could we enable?) transforms the output from a passive inventory into an active roadmap.
    • Storage Abstraction Pays for Itself: The LocalStorage / GCSStorage abstraction – a seemingly minor architectural decision – eliminated an entire class of development-vs-production bugs and made the system genuinely portable from day one.
    • Versioned Prompts Enable Iteration: Content-hashing every prompt template and recording the hash alongside inference results made prompt engineering a disciplined, reproducible process rather than an ad-hoc exercise.
    • Build on a Unified Cloud Ecosystem: Building entirely on Google Cloud services – BigQuery, Vertex AI, Cloud Run, Cloud Storage, Secret Manager – eliminated integration friction between components and allowed the project to move from prototype to production deployment without stitching together tools from multiple vendors.

    What we are building next

    1. Scheduled Re-Scans: Automated daily or weekly re-runs with alerting when a platform’s schema mutates – detecting drift before it causes downstream problems.
    2. Automatic Warehouse Verification: A verification layer that checks whether recommended Adverity fields are actually populated with non-null values, distinguishing between “this field exists” and “this field contains useful data.”
    3. Text-to-SQL: A natural-language-to-SQL layer that uses the column mappings to answer ad-hoc questions against the warehouse – turning the annotated genome into a conversational data interface.

  • Why your data genome may need a check-up – and how a Data Discovery Agent can help


    We collected every sample. We sequenced nothing.

    The story of big data is, at its core, a twenty-year experiment in accumulation. Google’s landmark papers on the Google File System (2003) and MapReduce (2004) showed the world how to store and process data at unprecedented scale [1] [2]. The Hadoop ecosystem followed. Then the cloud era – BigQuery, Redshift, Snowflake – made collection cheaper, faster, and infinitely more elastic. The message was unambiguous: store everything, figure it out later.

    And store they did. IDC projects that global data creation will grow at roughly 25% compound annual rate through 2028 [3]. Companies rushed to stake claims on every click, every impression, every conversion, convinced the data itself was the gold. But gold in the ground is worthless without extraction, refining, and assay. Most enterprises skipped those steps. Industry research finds that up to 90% of generated data remains unused, a phenomenon analysts have coined “dark data” [4]. The storage invoice arrives monthly. The insight dividend has not been declared yet [5] [6].

    Enter the paradigm shift. Large Language Models (LLMs) – GPT-3 (2020), ChatGPT (2022), Google’s Gemini (2023-2024) – represent the first technology that can read data at the semantic level [7], understanding meaning rather than merely executing queries. A data warehouse full of unanalysed tables is, in many respects, like an unsequenced genome. The information is all there – every column a base pair, every table a chromosome – but without annotation, it’s just a very expensive string of letters.

    Enterprise data is at that same inflection point. The sequencer has arrived. The genome is finally being read.


    2,791 base pairs. No annotation. Good Luck, Have Fun, Don’t Die.

    Every ad-tech data team carries an invisible tax: the hours, the errors, and the opportunity cost of manually reconciling platform schemas that were never designed to talk to each other. This is not a one-time project either but a recurring levy. Every time a platform updates its API, every time a new data source is onboarded, every time someone asks “do we even have geo data from Pinterest?” – the tax collector comes knocking.

    At the heart of our work sits a centralised advertising data warehouse that aggregates campaign performance, creative assets, audience signals, geographic breakdowns, and brand metadata from every major digital advertising platform our organisation operates on. It is, in effect, the single source of truth for understanding how creative content performs across the entire digital media landscape. It is also, as we were about to discover, a genome that had never been sequenced.

    We operate across fourteen advertising platforms – spanning major social, search, programmatic, and measurement partners. Each has its own schema. Facebook calls advertising expenditure amount_spent. Google calls it cost_micros (and means it, in millionths of a currency unit). TikTok simply says spend. They all mean roughly the same thing, but to a database and to a human analyst trying to build a cross-platform report, they might as well be different species encoding the same protein with entirely different codons (/ˈkəʊdɒn/ – three-letter DNA sequences that each specify the same amino acid, just spelled differently by each organism).

    Data integration platforms and transformation layers help – they abstract some of the raw API complexity and reshape the data before it lands in the warehouse. In theory, these layers converge towards clarity. In practice, they shift the problem rather than solve it. By the time data reaches the staging tables analysts actually query, the original platform semantics have been filtered through multiple translation layers – each managed by a different team, each with its own conventions. The result is not less complexity; it is distributed complexity.

    Now scale the problem. Our Google BigQuery data warehouse spans 179 tables and 2,709 columns across fourteen platforms, holding over 15.8 billion records. And those columns need to be understood by meaning, not merely by name.

    One downstream use case requires them to be mapped to a canonical schema organised around five modalities – high-level categories that describe what a piece of advertising data is about:

    • Performance – the numbers that tell you how a campaign did: spend, impressions, clicks, video views, conversions, leads.
    • Creative – what the ad actually looked like: creative IDs, asset paths, ad names, format types.
    • Audience – who saw it: gender, age group, interests, custom audiences, behavioural segments.
    • Geo – where they saw it: countries, regions, cities, postal codes, designated market areas.
    • Brand – who paid for it: brand name, advertiser identity.

    Each modality breaks down further into 24 sub-modalities – the individual genes in our metaphor. That’s 2,709 potential annotations to make, verify, and maintain.

    A skilled data analyst doing this manually needs to: open each table schema, read every column name and type, pull sample data to disambiguate, decide which canonical sub-modality it maps to, document the mapping, and then repeat thousands more times – praying that nothing changed since they started.

    To make that concrete: imagine staring at a table with columns ad_iddate_start, spend, inline_link_clicks, cpc, frequency, reachaccount_currency. Is spend total or daily? Is cpc cost-per-click or cost-per-conversion? Does reach mean unique users or total impressions? You run a SELECT * LIMIT 10, squint at the numbers, cross-reference the API docs, and after twenty minutes you’ve mapped eight columns from one table. Only 2,701 to go – across thirteen more platforms.

    Conservatively, that’s two to four weeks of focused work for a single pass. And the result starts decaying immediately. Platforms update schemas quarterly, sometimes monthly. By the time the spreadsheet is “done,” it’s already wrong.

    If you’ve ever opened a file called FINAL_mapping_v3_actually_final.xlsx, this section is for you.

    In genomic terms: it’s annotating a genome by hand, one base pair at a time, with no automated sequencer and no reference genome. The Human Genome Project proved you can do it that way. It just took thirteen years and $2.7 billion. We wanted something faster.


    The sequencer has entered the lab

    What used to take a team of data analysts two to four weeks of grinding manual work (opening schemas, pulling samples, cross-referencing documentation, writing mappings into a spreadsheet) now runs in hours. The Data Discovery Agent processes all fourteen platforms, all 179 tables, all 2,709 columns, and delivers a fully annotated mapping with confidence scores.

    Figure 1 – Manual mapping workflow vs. Agent pipeline: weeks of spreadsheet wrangling compressed into a single automated run.

    The Data Discovery Agent is not a chatbot, nor a single prompt you paste into ChatGPT and hope for the best. It is a multi-stage autonomous platform that orchestrates a complete end-to-end pipeline of discovery, sampling, reasoning, and reporting with minimal human intervention. You point it at a data warehouse, define your canonical schema, and it does the rest: connecting to BigQuery, fetching every table and column, extracting sample rows for grounding, crawling external documentation for cross-reference, building structured prompts, invoking LLM inference, parsing the results, scoring confidence, calculating completeness, and rendering the whole thing in an interactive dashboard. The human’s job shifts from doing the mapping to reviewing the mapping: approving high-confidence matches, investigating ambiguous ones, and making strategic decisions about data gaps.

    So, what is an agent?

    In the AI and machine learning community, an agent is formally defined as a system that: (1) perceives its environment

    (2) reasons about what actions to take

    (3) executes those actions autonomously

    (4) iterates towards a goal – often using tools and external data sources along the way.

    Figure 2 – A simple visual of how an AI agent operates: Perceive → Reason → Act → Iterate.

    By that definition, our Data Discovery Agent is genuinely an agent: it perceives the warehouse (schema fetching), reasons about semantics (LLM inference with structured prompts), takes action (mapping columns, scoring confidence, crawling documentation), and iterates (batch processing across platforms and modalities, second-pass enrichment recommendations). This is far more than a single LLM call wrapped in a REST endpoint. It’s an orchestrated pipeline of perception-reasoning-action loops.

    Yes, the word “agent” has been stretched to meaninglessness in 2025-2026 – every API wrapper calls itself one. We use it deliberately, with the formal definition above, because our system genuinely exhibits autonomous multi-step goal-directed behaviour. If your “agent” is a prompt template with a for-loop, “script” might be a more honest label.


    Inside the sequencer: how we taught an LLM to read a data warehouse

    Teaching an LLM to read a data warehouse is not as simple as dumping a schema into a prompt. Column names are ambiguous. Data types are insufficient. Context is everything. Our pipeline gives the model maximum context at every step – schema structure, real sample values, external documentation, and a precisely defined target taxonomy – so that its annotations are grounded, not hallucinated. In genomic terms, this is the equivalent of handing a bioinformatician the raw sequence alongside the reference genome, the medical textbook, and a clear checklist of which genes to look for.

    The following five stages take us from raw warehouse to annotated genome.

    Stage 1 – Discovery

    The agent connects to Google BigQuery and discovers all staging datasets – in our case, fourteen platforms matching a configurable naming pattern. For each dataset, it fetches the full table inventory: column names, data types, row counts. Intelligent filters exclude temporary artefacts (tables prefixed with stg_, suffixed with _tmp or _dbt_tmp) – the genomic equivalent of filtering out sequencing adapters before analysis.

    Stage 2 – Sampling

    Column names alone are often ambiguous. The agent resolves this by extracting actual sample data rows from each table, using background-threaded parallel queries. These real values become critical evidence for the LLM: seeing 14.500.83127.99 in a cost column strongly suggests monetary spend, not an ID field. Column names are the labels on the test tubes; sample data is what’s actually inside them.

    Stage 3 – Documentation crawling

    The agent crawls the official connector documentation for each platform – the authoritative “Most Used Fields” pages published by our data integration provider – and extracts a structured field list: name, description, dimension or metric. This gives the agent a reference genome to cross-reference against what actually exists in the warehouse: the platform’s own user manuals turned into a checklist of expected signals.

    Stage 4 – LLM inference

    This is the core intellectual step. The agent builds versioned prompts containing the table schemas, sample data, and the target modality definition. These are sent to Google’s Batch Inference Service (VertexAI). The model returns structured JSON: for each canonical sub-modality, it identifies matching columns, assigns a confidence score (0.0–1.0), and provides natural-language reasoning for each match. This is the sequencing run itself – the machine reading every base pair and producing an annotated readout with quality scores.

    Stage 5 – Enrichment and dashboard (the genome report)

    A second-pass “Enrichment Recommender” inference cross-references the warehouse mappings from Stage 4 against the connector field lists from Stage 3. For every sub-modality where warehouse data is incomplete, it recommends which connector fields to enable – closing the loop from “what do we have?” to “what should we turn on?”

    All results render in a FastAPI + Jinja2/HTMX dashboard: the karyotype (/ˈkariə(ʊ)tʌɪp/ the number and visual appearance of the chromosomes in the cell nuclei of an organism or species), the gene annotation browser, and the clinical recommendations in one interface.


    The genome report: what 2,709 annotations revealed

    When the sequencing run completes, you don’t get a wall of text. You get a genome report in the shape of a visual, interactive, filterable dashboard that tells you, at a glance, which parts of your data genome are fully annotated and which have gaps.

    Figure 3 – Platform completeness scorecards: the karyotype of our data warehouse at a glance.

    Each platform receives a “completeness” score: how much of the canonical schema is actually present. The approach is deliberately simple – a presence/absence assay. For each platform, the denominator is 24 (total sub-modalities). The numerator is how many have at least one column mapped above a configurable confidence threshold. A sub-modality is either present or it isn’t, like a gene that’s either expressed or silent. This also handles deduplication: if three tables each have a spend column mapped to performance__spend_usd, the sub-modality counts once, not three times. The formula cares about breadth of coverage, not depth of redundancy.

    Figure 4 – The interactive dashboard: Submodality matching recommendation cards

    Across all fourteen platforms, average mapping completeness stands at 58%. The top performer reached 75% (18/24 sub-modalities covered), while the platform with the greatest room for improvement sat at 25% (6/24). Four platforms cluster at 70%, several sit in the 50–66% range, and a handful fall below 50%.

    Some findings were pleasant surprises – platforms assumed to lack audience data turned out to carry age range and gender columns that mapped cleanly. The data was there all along; it just hadn’t been annotated. Other findings were equally valuable: Performance metrics (spend, impressions, clicks) are well-represented across the board, but Audience, Geo, and Brand modalities show significant gaps, particularly in measurement and programmatic categories.

    At 58%, the genome is more than half-sequenced – but meaningful blind spots remain. That’s a useful headline. It is unequivocally better to know your genome’s current state than to assume it’s healthy without running the test. The scorecard turns vague unease into actionable intelligence (“We’re missing geo data on most platforms, and here’s exactly which fields to enable”). It also delivers a reality check: the foundation is stronger than expected in some areas and weaker than assumed in others. But now we know exactly where to invest.

    The Enrichment Recommender closes the diagnostic loop. Rather than simply flagging missing geo data, it prescribes a fix: “The connector for this platform offers a field called country_code (dimension); enable it, and your geo coverage improves.” Diagnosis and prescription in one step, and a natural bridge to the next phase of our work.


    What we’re building now

    The genome report is valuable on its own. But a sequencer that only reads and never acts is leaving half the value on the table. We are currently working on three initiatives that extend the agent from discovery into action; from reading the genome to editing it.

    1. Productionalisation – from lab bench to clinical practice

    Today the agent runs as an internal tool, triggered manually. We are moving it to production-grade: CI/CD pipelines, scheduled re-scans (daily, weekly, on-schema-change), and automated alerting when a platform’s schema mutates. The genomic equivalent of detecting a mutation before it causes problems downstream. A sequencer that runs itself.

    2. Automatic warehouse verification – is the gene actually expressed?

    The Enrichment Recommender tells you which fields exist in the documentation. But existence does not mean expression. A field can be listed, enabled in the connector, and still arrive empty because the advertiser never populated it.

    We are building a verification layer that connects to the data integration platform’s API, pulls each connector’s field list, and checks the warehouse: Is this column actually populated? What percentage of rows have non-null values?

    In DNA terms: it’s one thing to know a gene exists in the reference genome; it’s another to confirm it’s expressed in your organism. The practical outcome: instead of blind recommendations, the agent will say “Enable country_code – 87% of advertisers populate it” versus dma_region has only 3% non-null rows, probably not worth the effort.”

    3. Text-to-SQL – chatting with the genome

    Discovery answers “what data do we have?” The next question is always “what does the data say?” We are building a natural-language-to-SQL layer: ask “What was our total spend across social platforms last quarter?” and the agent translates it to a BigQuery query – using the column mappings it already knows – executes it, and returns a human-readable answer with the underlying SQL for transparency.

    This positions the platform as a full conversational data layer: discover the genome, annotate it, score completeness, recommend enrichment, and then interrogate it in natural language. From specimen collection to clinical consultation, without leaving the lab.


    The genome is sequenced. The real work begins.

    For two decades, the advertising industry collected data the way early geneticists collected specimens: enthusiastically, comprehensively, and with very little annotation. We built warehouses the size of genomes and staffed them with analysts armed with spreadsheets – the biological equivalent of a magnifying glass and a very long afternoon.

    The Data Discovery Agent is what happens when you finally bring a sequencer into the lab. It reads every base pair, annotates every gene, scores every alignment, flags every gap, and delivers a genome report that turns “we have data” into “we understand our data.” And unlike the Human Genome Project, it does not take thirteen years. It takes hours.

    The genome is sequenced. The annotations are in. The gaps are mapped. Now the real work begins: not the tedium of cataloguing, but the science of insight. And if the sequencer keeps getting smarter (spoiler: it will), the next chapter will be about more than reading the data. It will be about having a conversation with it.


    Ready to look under the hood? See Data Discovery Agent – Technical Walkthrough for the full engineering deep-dive – architecture, five-stage pipeline, Vertex AI inference, FastAPI deployment on Cloud Run, and code-level detail.


    1. Dean, J., & Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters. Proceedings of OSDI ’04.
    2. Ghemawat, S., Gobioff, H., & Leung, S.-T. (2003). The Google File System. Proceedings of the 19th ACM SOSP.
    3. IDC, Worldwide Global DataSphere Forecast 2023-2028.
    4. Splunk dark data survey, summarised in “Dark Data: Unlocking the ~90% You Don’t Use” (2025).
    5. McKinsey & Company, “In search of cloud value: Can generative AI transform cloud ROI?” (2023).
    6. Forbes, “From Data Hoarding To Data Strategy, Building AI That Actually Works” (2025).
    7. Forrester, AI, data, and analytics predictions (2024).

    Disclaimer: This content was created with AI assistance. All research and conclusions are the work of the WPP AI Lab team.