Turning raw ad platform schemas into actionable intelligence through AI-powered semantic mapping.
TL;DR
Up to 90% of enterprise data sits unused, and our BigQuery advertising warehouse was no exception: 14 platforms, 179 tables, 2,709 columns, 15.8 billion records, zero shared schema, and answering a question as simple as whether Pinterest carries geo data required a two-week manual investigation. Every platform invented its own vocabulary (Facebook says amount_spent, Google says cost_micros, TikTok says spend) and manual reconciliation cost analysts two to four weeks per pass, producing a spreadsheet that started decaying the moment it was finished.
We built the Data Discovery Agent, an AI-powered pipeline that connects to BigQuery, samples real data for grounding, crawls external connector documentation, and runs Gemini LLM inference with versioned prompts to autonomously annotate every column with confidence-scored mappings. The agent compressed weeks of manual work into hours, took us from zero structured understanding of our warehouse to a 58% average completeness (column presence per canonical sub-modality; signal strength and population validation planned for Q2) baseline across all 14 platforms, and replaced tribal-knowledge spreadsheets with a searchable, versioned genome report that turns “do we have that data?” from a research project into a dashboard lookup.
If you don’t care about the technical details, read our blog post instead.
Technical walkthrough
Introduction
Advertising data warehouses are, by nature, enormous and messy. Every major ad platform – Facebook, Google, TikTok, and a dozen others – exposes its own schema, its own naming conventions, and its own definition of what a “click” or a “spend” actually means. When these platforms feed into a centralised Google BigQuery data lake, the result is a sprawling collection of tables and columns that no single analyst can hold in their head. In our case: 14 advertising platforms, 179 tables, and 2,709 columns – with over 15.8 billion records behind them.
Historically, making sense of this warehouse required manual schema annotation: a data analyst opening each table, reading every column name, pulling sample rows, cross-referencing API documentation, and deciding which canonical category each column belonged to. Conservatively, that process took two to four weeks of focused work for a single pass. And the result started decaying immediately – platforms update schemas quarterly, sometimes monthly.
To eliminate this bottleneck, we architected and deployed the Data Discovery Agent: an AI-powered platform that autonomously discovers, maps, and scores the completeness of advertising data across an entire BigQuery data lake. The agent connects to BigQuery, fetches every table and column, extracts real sample data for grounding, crawls external documentation for cross-reference, invokes Vertex AI LLM inference with structured prompts, and delivers a fully annotated mapping with confidence scores – compressing weeks of manual work into hours.
The system is built as a production-grade FastAPI web application, deployed on Google Cloud Run, with a complete admin control panel, role-based access via Google OAuth, and a storage abstraction layer that operates identically on local filesystems and Google Cloud Storage. This document provides a comprehensive technical walkthrough of its architecture, pipeline, deployment, and results.
Agent architecture
The Data Discovery Agent is structured as a five-stage sequential pipeline orchestrated by a FastAPI backend. Unlike a chatbot or a single-prompt wrapper, the agent executes an end-to-end workflow of discovery, sampling, documentation crawling, LLM inference, and enrichment reporting – with minimal human intervention. The human’s role shifts from doing the mapping to reviewing the mapping.
The high-level data flow is illustrated below:

The architecture was inspired by a clean separation of concerns. The application core follows a layered pattern with dedicated service modules.
Service module inventory
The backend is organised into focused service modules, each responsible for a distinct domain:
| Service Module | Core Responsibility |
|---|---|
| BigQuery Service | Connects to the BigQuery data lake, discovers datasets and tables, fetches column schemas and row counts, and saves timestamped fetch snapshots. |
| Sample Service | Extracts real sample rows from BigQuery tables using background-threaded parallel queries. Supports pause, resume, and abort controls. |
| Adverity Service | Crawls official Adverity connector documentation pages using Playwright, extracting structured field lists (name, description, dimension/metric). |
| Inference Service | Builds versioned LLM prompts, manages sync and batch inference pipelines via Vertex AI, and parses structured JSON responses into column-to-submodality mappings. |
| BQ Filter Service | Manages regex-based include/exclude rules that control which BigQuery datasets and tables appear in a fetch. |
| User Service | Handles user CRUD operations, domain allow-lists, email ban-lists, and role-based access control (admin/user). |
| LLM Config Service | Loads and manages LLM model configurations, including model identifiers and API parameters. |
| Vertex Model Service | Discovers available Vertex AI regions and publisher models for inference. |
Key system capabilities
- Autonomous Schema Discovery: The agent connects to BigQuery, applies configurable regex filters, and discovers all advertising-platform datasets, tables, and column schemas without manual intervention.
- Evidence-Grounded LLM Inference: Every mapping decision is grounded in real data – column names, data types, and actual sample values – not just schema metadata. This dramatically reduces hallucination risk.
- Dual-Direction Mapping: The system performs both forward mapping (BigQuery columns to canonical sub-modalities) and reverse mapping (Adverity connector fields to sub-modalities), closing the loop from “what do we have?” to “what should we enable?”
- Completeness Scoring: Each platform receives a quantitative completeness score measuring how many of the 24 canonical sub-modalities are covered, turning vague data quality concerns into actionable metrics.
- Versioned Prompt Engineering: All LLM prompts are version-tracked with SHA-256 content hashing, ensuring that every inference result can be traced back to the exact prompt that produced it.
- Transparent Storage Abstraction: A unified storage layer allows the application to operate identically on local filesystems (development) and Google Cloud Storage (production) by changing a single environment variable.
- Production-Grade Access Control: Google OAuth 2.0 authentication with domain allow-lists, email ban-lists, and admin/user role separation ensures secure, auditable access.
The modality schema: a canonical taxonomy
At the heart of the agent’s mapping logic is a canonical modality schema – a structured taxonomy that defines what advertising data means, independent of any platform’s naming conventions. The schema is organised around five high-level modalities, each representing a fundamental category of advertising measurement:
| Modality | Sub-modalities | What It Captures |
|---|---|---|
| Performance | 7 | The numbers: spend, impressions, clicks, video plays, conversions, leads, CTR |
| Creative | 4 | What the ad looked like: creative IDs, asset paths, ad names, format types |
| Audience | 6 | Who saw it: gender, age group, interests, custom audiences, behavioural segments |
| Geo | 5 | Where they saw it: countries, regions, cities, postal codes, designated market areas |
| Brand | 2 | Who paid for it: brand name, advertiser identity |
This yields 24 sub-modalities in total – the atomic units of the mapping. The LLM’s task is to determine, for each BigQuery column across all 179 tables, which sub-modality (if any) it corresponds to, and with what confidence.
The modality dictionary is fully editable through the admin UI, meaning the taxonomy can be extended or refined without touching code. It is injected into every LLM prompt as structured context, ensuring consistent mapping behaviour across inference runs.
Technology stack
| Component | Technology | Role |
|---|---|---|
| Web Framework | FastAPI | Application backend, API routing, middleware, and session management |
| LLM Engine | Google Gemini (via Vertex AI) | Powers all schema-mapping and recommendation inference across the Gemini model family (including Gemini Flash for batch runs); chosen for low latency, strong instruction-following, and structured JSON output |
| Data Warehouse | Google BigQuery | Primary data source; the agent queries dataset schemas, table metadata, and sample rows |
| Batch Inference | Vertex AI Batch Prediction | Processes bulk LLM requests asynchronously via JSONL files staged in GCS; more cost-effective for full runs across all platforms |
| Documentation Crawling | Playwright | Headless browser automation for crawling Adverity connector documentation pages and extracting structured field lists |
| Frontend | Jinja2 + HTMX | Server-side rendered HTML templates with HTMX for interactive, partial-page updates without a JavaScript framework |
| Authentication | Google OAuth 2.0 (via Authlib) | Secure user login with domain allow-lists, email ban-lists, and admin/user role separation |
| Application Data | Google Cloud Storage | Persistent store for fetch results, inference outputs, platform configs, user records, and Adverity documentation |
| Deployment | Google Cloud Run | Serverless container hosting with auto-scaling, health probes, and secret-backed environment variables |
| Infrastructure | Terraform | Infrastructure-as-code: GCS buckets, Artifact Registry, Cloud Run service, IAM roles, Secret Manager secrets |
| Containerisation | Docker (multi-stage build) | Reproducible builds with a builder stage for dependency installation and a slim runtime stage running as non-root |
| Secret Management | Google Secret Manager | Stores OAuth credentials and session-signing keys, injected into Cloud Run as environment variables |
| Language | Python 3.11 | All application code, managed via Poetry for dependency resolution |
Pipeline deep-dive
The following five stages take the system from a raw, unannotated data warehouse to a fully mapped, scored, and actionable genome report.
Stage 1 – Discovery
The agent connects to Google BigQuery and discovers all staging datasets matching a configurable naming pattern (e.g., ^bq_cgh_mp_.*_staging$). For each dataset, it fetches the full table inventory: column names, data types, and row counts. Intelligent regex filters – managed through the admin UI via bq_filters.json – exclude temporary artefacts such as tables prefixed with stg_, suffixed with _tmp or _dbt_tmp, and any datasets that don’t match the expected naming convention.
Results are saved as a timestamped fetch snapshot (fetches/fetch_YYYYMMDD_HHMMSS/results.json). Multiple fetches can coexist, allowing admins to compare schema evolution over time. Each fetch captures the complete state of the warehouse at a point in time – the starting point for all downstream inference work.
Current warehouse dimensions:
| Metric | Count |
|---|---|
| Advertising Platforms | 14 |
| BigQuery Datasets | 17 |
| Tables | 179 |
| Total Columns | 2,709 |
| Total Records | 15.8 billion+ |
Platforms covered
| # | Platform | # | Platform |
|---|---|---|---|
| 1 | Amazon DSP | 8 | |
| 2 | Facebook Ads | 9 | Snapchat |
| 3 | Google Ads | 10 | TikTok |
| 4 | Google Ads (YouTube) | 11 | The Trade Desk |
| 5 | Google DV360 | 12 | Twitter/X Ads |
| 6 | Google DV360 (YouTube) | 13 | Xandr |
| 7 | 14 | IAS |
Stage 2 – Sampling
Column names alone are often ambiguous. A column called cost could mean total spend, cost-per-click, or an internal ID. The agent resolves this ambiguity by extracting actual sample data rows from each table using background-threaded parallel queries against BigQuery.
These real values become critical evidence for the LLM: seeing 14.50, 0.83, 127.99 in a cost column strongly suggests monetary spend, not an identifier. Seeing M, F, Unknown in a gender column confirms an audience dimension more reliably than the column name alone.
The extraction process supports pause, resume, and abort controls – essential when sampling across hundreds of tables with billions of rows. Progress is tracked per-platform and persisted to storage, so an interrupted extraction can resume exactly where it left off. Samples are saved as samples_<platform>.json per fetch and are injected into the LLM prompt alongside the schema metadata.
Stage 3 – Documentation crawling
The agent crawls the official connector documentation for each platform – specifically, the authoritative “Most Used Fields” pages published by Adverity, the data integration platform that feeds data into the BigQuery warehouse. Using Playwright for headless browser automation, the crawler fetches each documentation page and extracts a structured field list: name, description, and whether the field is a dimension or metric.
Each crawled page is saved in three formats – HTML, Markdown, and structured JSON – under adverity_docs/ in the storage backend. These documents serve as the agent’s “reference manual”: a ground-truth checklist of what fields each platform could provide, independent of what currently exists in the warehouse.
The crawl engine runs as a background task with progress tracking, with a configurable delay between requests to avoid rate-limiting. Platform URLs are managed through the admin UI, making it straightforward to add new data sources.
Stage 4 – LLM inference
This is the core intellectual step – where the agent reads every column and produces an annotated mapping. The Inference Service builds versioned prompts containing three layers of context:
- Table schemas – column names and data types for every table in the platform’s dataset.
- Sample data – real row values extracted in Stage 2, providing grounding evidence.
- Modality definitions – the target taxonomy of modalities and sub-modalities the LLM must map to.
These structured prompts are sent to models from the Google Gemini family via Vertex AI. The model returns structured JSON: for each canonical sub-modality, it identifies matching columns, assigns a confidence score (0.0 to 1.0), and provides natural-language reasoning for each match.
Inference modes
Two inference modes are supported, selectable through the admin UI:
| Mode | Mechanism | Best For |
|---|---|---|
| Sync | One LLM request per platform, processed sequentially via POST /api/inference/sync | Quick testing, single-platform runs |
| Batch | Prompts written to JSONL files in GCS, submitted as Vertex AI batch prediction jobs, polled for completion | Full production runs across all 14 platforms; significantly more cost-effective |
For batch mode, the workflow is: write prompts to JSONL, submit batch jobs, poll status until completion, collect and parse results. The UI provides real-time progress tracking for all active jobs.
Prompt engineering
All prompt templates live in app/prompts/ and are version-tracked with semantic versioning and SHA-256 content hashing via versions.json. This means every inference result records exactly which prompt version and content hash produced it – critical for reproducibility.
The base modality mapping prompt instructs the LLM to:
- Examine every column name and data type across all provided tables.
- Use sample data to confirm or reject potential matches.
- Assign confidence scores on a defined scale (0.9-1.0 for clear matches, 0.5-0.69 for ambiguous, below 0.5 for weak).
- Return only valid JSON – no markdown, no prose.
A dedicated Adverity recommendation prompt performs the reverse mapping: given Adverity connector documentation, recommend which Adverity fields map to which sub-modalities. This prompt enforces a critical rule – field names must appear verbatim in the documentation, preventing the LLM from inventing fields that don’t exist.
Robust JSON parsing
LLM outputs are rarely perfectly formatted. The Inference Service includes a robust JSON parser (_robust_parse_llm_json) that handles common LLM formatting mistakes: markdown code fences, trailing commas, single-quoted strings, BOM characters, leading/trailing prose around the JSON object, and escaped newlines inside strings. This defensive parsing layer ensures that minor formatting issues never cause a pipeline failure.
Three inference pipelines
The system supports three distinct inference types:
- Modality Mapping Inference – Maps BigQuery columns to the canonical sub-modality taxonomy. The primary pipeline.
- Adverity Recommendation Inference – Given Adverity documentation, recommends which connector fields should be enabled for each sub-modality.
- Field Extraction Inference – Extracts structured field definitions from raw Adverity HTML/Markdown documentation using chunked batch inference, producing clean per-platform field maps.
Each run records metadata: timestamp, model used, prompt version, prompt content hash, and platforms processed.
Stage 5 – Enrichment and dashboard
The final stage closes the diagnostic loop by cross-referencing the warehouse mappings from Stage 4 against the connector field lists from Stage 3.
Enrichment recommendations
The Adverity Recommendation inference goes beyond simply flagging missing geo data. It prescribes a fix: “The connector for this platform offers a field called country_code (dimension); enable it, and your geo coverage improves from 2/5 to 3/5 sub-modalities.” Diagnosis and prescription in one step, turning gap analysis into concrete action items for the data engineering team.
Completeness scoring
Each platform receives a completeness score: how much of the canonical 24-sub-modality schema is actually present. The approach is deliberately simple – a presence/absence assay. For each platform, the denominator is 24 (total sub-modalities). The numerator is how many have at least one column mapped above a configurable confidence threshold. A sub-modality is either present or it isn’t – this handles deduplication naturally. If three tables each have a spend column mapped to performance__spend_usd, the sub-modality counts once, not three times.
The executive dashboard
All results render in a FastAPI + Jinja2/HTMX dashboard that provides:
- Platform scorecards – one card per advertising platform showing its name, banner image, and completeness percentage.
- Modality drilldowns – for each platform, expandable breakdowns into modalities and their sub-modalities.
- Column mappings – the specific BigQuery columns mapped to each sub-modality, with confidence scores and LLM reasoning.
- Sample data preview – actual row data from each table for manual verification.
- Executive summary sidebar – aggregated statistics across all platforms: total mappings, average completeness, record counts, highest and lowest performing platforms.
- Interactive mapping visualiser – a left-right mapping view, filterable by platform, modality, and confidence threshold.
Visibility of platforms is controlled via Dashboard Settings, where admins choose which fetch and which platforms appear to end users.
PDF executive report generation
The dashboard includes a one-click PDF executive report generator for stakeholders who need a polished, offline-readable document. When triggered, the system:
- Aggregates all dashboard data – platform stats, completeness scores, modality breakdowns, and mapped column details – into a single render context.
- Renders a dedicated Jinja2 report template (
exec_report.html) – a professionally styled, multi-page A4 document with a cover page, executive overview, key metric summary cards, a platform comparison table, the full modality dictionary with descriptions, and per-platform detail sections showing completeness stats, modality coverage grids (with check/miss indicators per sub-modality), and optionally the full column mapping list. - Converts the HTML to PDF using Playwright‘s headless Chromium instance (
page.pdf(format="A4", print_background=True)), producing a pixel-perfect render with page breaks, headers, and footers showing the generation timestamp and page numbers. - Returns the PDF as a downloadable file (e.g.,
exec_report_20260327_112709.pdf).
The report is designed for executive consumption: the first page presents headline metrics (total platforms, tables, columns, records, average completeness) and identifies the best and worst performers. Subsequent pages provide the modality dictionary for reference, followed by detailed per-platform breakdowns with modality coverage grids showing exactly which sub-modalities are mapped (with column counts) and which remain gaps. This gives leadership a complete, self-contained snapshot of data integration maturity without requiring dashboard access.
Cloud deployment
The system is deployed as a production-grade, cloud-native service on Google Cloud, following a containerised, infrastructure-as-code workflow from local development through to automated deployment.
Infrastructure-as-Code (Terraform)
All infrastructure is defined as Terraform in deployment/terraform/ with six modules:
| Module | Resources Created |
|---|---|
apis | Enables 10 GCP APIs (Cloud Run, Artifact Registry, Storage, BigQuery, Vertex AI, Secret Manager, IAM, Cloud Build, etc.) |
artifact_registry | Docker image repository with a keep-last-5 cleanup policy |
gcs | Versioned GCS bucket for application data (platforms, fetches, users, banners) with lifecycle rules |
iam | Service account with roles: Storage Object Admin, BigQuery Data Viewer, BigQuery Job User, Vertex AI User, Secret Accessor |
secret_manager | Three secrets: SECRET_KEY, GOOGLE_CLIENT_ID, GOOGLE_CLIENT_SECRET |
cloud_run | Cloud Run v2 service with health probes, auto-scaling (0 to 3 instances), and secret-backed environment variables |
Containerisation
The application is packaged as a Docker image using a multi-stage build:
- Builder stage – Installs Python dependencies via Poetry, exports to
requirements.txt, and installs packages into a clean prefix. - Runtime stage – Copies only the installed packages and application code into a slim Python 3.11 image. Runs as a non-root
appuserfor security. Seed data (platform configs, modality dictionaries, Adverity docs) is baked into the image; runtime data (fetches, users, inference results) lives in the GCS bucket.
Deployment pipeline
A full deployment is executed with a single command:
make deploy-all # tf-init -> tf-apply -> docker-push -> seed-data
This runs Terraform to provision/update all infrastructure, builds and pushes the Docker image to Artifact Registry, and syncs the local data/ directory to the GCS bucket. Incremental deployments (code-only changes) use make deploy which skips the Terraform init step.
The seed-data target uses gsutil -m rsync to upload platform definitions, modality dictionaries, BQ filters, Adverity documentation, and other configuration data into the GCS bucket. Without seeding, a fresh deployment starts with an empty bucket and no platform or user configuration.
Storage abstraction layer
A critical architectural decision was the storage abstraction layer (app/core/storage.py). All data I/O goes through a unified StorageBackend interface with two implementations:
LocalStorage– reads/writes to the localdata/directory. Used during development (STORAGE_BACKEND=local).GCSStorage– reads/writes to a GCS bucket under a configurable prefix. Used in production (STORAGE_BACKEND=gcs).
Switching between backends requires changing only the STORAGE_BACKEND environment variable. Every service module calls get_storage() and operates through the same API regardless of the underlying storage – ensuring that code tested locally behaves identically in production.
Authentication and access control
The application uses Google OAuth 2.0 (OpenID Connect) for user authentication:
- Users log in with their Google account via the standard OAuth consent flow.
- Domain allow-lists restrict login to specific email domains (e.g.,
satalia.com,choreograph.com). - Email ban-lists allow blocking specific users.
- Role-based access separates admin users (full control panel) from regular users (executive dashboard only).
- A bypass mode (
BYPASS_AUTH_AS_ADMIN=true) enables local development without Google credentials.
All user and auth settings are managed through the admin UI and persisted via the storage backend.
Results and impact
When the pipeline completes a full run across all fourteen platforms, the executive dashboard delivers a comprehensive genome report of the data warehouse. The following results are drawn from the live executive summary generated by the dashboard.
Current warehouse state
| Metric | Value |
|---|---|
| Platforms Monitored | 14 |
| Tables | 179 |
| Columns | 2,709 |
| Total Records | 15.8B |
| Average Completeness | 58% |
| Highest Completeness | Snapchat – 75% (18/24) |
| Lowest Completeness | Integral Ad Science – 25% (6/24) |
Per-platform breakdown
| Platform | Completeness | Tables | Columns | Records |
|---|---|---|---|---|
| Facebook Ads | 66% (16/24) | 15 | 364 | 3,147.7M |
| Google Ads | 70% (17/24) | 21 | 399 | 2,284.6M |
| TikTok | 62% (15/24) | 14 | 217 | 15.1M |
| 70% (17/24) | 12 | 143 | 366.4k | |
| Amazon DSP | 62% (15/24) | 18 | 213 | 18.2M |
| Integral Ad Science | 25% (6/24) | 12 | 130 | 95.9M |
| Xandr | 50% (12/24) | 11 | 156 | 1.2M |
| Google Ads (YouTube) | 45% (11/24) | 9 | 112 | 10.9M |
| Display & Video 360 | 70% (17/24) | 16 | 275 | 10,148.3M |
| Display & Video 360 (YouTube) | 54% (13/24) | 15 | 187 | 63.6M |
| 70% (17/24) | 11 | 170 | 6.7M | |
| Snapchat | 75% (18/24) | 10 | 150 | 15M |
| The Trade Desk | 45% (11/24) | 6 | 80 | 4.1M |
| Twitter Ads | 50% (12/24) | 9 | 113 | 40.4k |
Headline findings
Four platforms – Google Ads, LinkedIn, Display & Video 360, and Pinterest – cluster at 70% completeness (17/24 sub-modalities). Snapchat leads the pack at 75%. Several platforms sit in the 50-66% range, and a handful fall below 50%.
Performance metrics (spend, impressions, clicks) are well-represented across the board – these are the columns every platform exposes and every data team queries first. But Audience, Geo, and Brand modalities show significant gaps, particularly in measurement and programmatic categories. Some findings were genuine surprises: platforms assumed to lack audience data turned out to carry age range and gender columns that mapped cleanly. The data was there all along – it just hadn’t been annotated.
At 58%, the warehouse is more than half-mapped – but meaningful blind spots remain. That’s a useful headline. It is unequivocally better to know the current state than to assume it’s healthy without running the test.
Operational impact
| Before (Manual) | After (Agent) |
|---|---|
| 2-4 weeks per full mapping pass | Hours for a complete run |
| Mapping decays immediately as schemas change | Re-run on demand; schema drift detected automatically |
| Tribal knowledge locked in spreadsheets | Structured, versioned, searchable mappings with confidence scores |
| Answering “do we have geo data from Pinterest?” required manual investigation | Dashboard lookup: instant answer with completeness score |
| Enrichment recommendations require manual cross-referencing | Automated prescriptions: “Enable country_code on this connector” |
The shift is from doing the mapping to reviewing the mapping – a fundamentally different use of analyst time.
Admin workflow
The typical end-to-end admin workflow from raw data to a published dashboard follows this sequence:

Each step is performed through the web-based admin UI. The pipeline is intentionally manual at the trigger level – an admin decides when to run each stage – while the execution of each stage is fully automated. This design provides human oversight at decision points while eliminating manual drudgery within each step.
Conclusion
This technical walkthrough has presented the Data Discovery Agent: a production-grade, AI-powered platform that transforms the laborious process of advertising data warehouse annotation from weeks of manual spreadsheet work into an automated, repeatable, and auditable pipeline.
The agent’s five-stage architecture – Discovery, Sampling, Documentation Crawling, LLM Inference, and Enrichment – systematically builds context at each step so that the LLM’s mapping decisions are grounded in real evidence rather than speculation. The result is a fully annotated data warehouse with confidence-scored mappings, quantitative completeness metrics, and actionable enrichment recommendations.
Lessons learned
- Grounding is Everything: Providing the LLM with real sample data alongside schema metadata was the single most impactful design decision. Column names alone are ambiguous; actual values resolve that ambiguity decisively.
- Dual-Direction Mapping Closes the Loop: Mapping warehouse columns forward (what do we have?) and connector fields backward (what could we enable?) transforms the output from a passive inventory into an active roadmap.
- Storage Abstraction Pays for Itself: The
LocalStorage/GCSStorageabstraction – a seemingly minor architectural decision – eliminated an entire class of development-vs-production bugs and made the system genuinely portable from day one. - Versioned Prompts Enable Iteration: Content-hashing every prompt template and recording the hash alongside inference results made prompt engineering a disciplined, reproducible process rather than an ad-hoc exercise.
- Build on a Unified Cloud Ecosystem: Building entirely on Google Cloud services – BigQuery, Vertex AI, Cloud Run, Cloud Storage, Secret Manager – eliminated integration friction between components and allowed the project to move from prototype to production deployment without stitching together tools from multiple vendors.
What we are building next
- Scheduled Re-Scans: Automated daily or weekly re-runs with alerting when a platform’s schema mutates – detecting drift before it causes downstream problems.
- Automatic Warehouse Verification: A verification layer that checks whether recommended Adverity fields are actually populated with non-null values, distinguishing between “this field exists” and “this field contains useful data.”
- Text-to-SQL: A natural-language-to-SQL layer that uses the column mappings to answer ad-hoc questions against the warehouse – turning the annotated genome into a conversational data interface.




