{"id":155,"date":"2026-03-24T15:38:14","date_gmt":"2026-03-24T15:38:14","guid":{"rendered":"https:\/\/thelab.wppresolve.com\/?post_type=research_pod&#038;p=155"},"modified":"2026-04-28T10:00:02","modified_gmt":"2026-04-28T10:00:02","slug":"data-quality-agent-pod","status":"publish","type":"research_pod","link":"https:\/\/cms.research.wpp.com\/?research_pod=data-quality-agent-pod","title":{"rendered":"Data Quality Agent Pod"},"content":{"rendered":"\n<p><em>Silent data corruption is a well-documented challenge across modern ML pipelines. Broken ingestion jobs, schema drift, logical inconsistencies: these issues rarely trigger alerts, and by the time they&#8217;re caught, downstream models may have already been learning from compromised data. We built an autonomous agent that audits data directly in BigQuery, runs forensic structural and logical checks with zero manual input, and , crucially, remembers. Its persistent memory architecture means every audit sharpens the next, elevating data quality from a routine operational task into a compounding strategic advantage. The results: F1 of 0.88, perfect detection on 73% of test scenarios, and 100% consistency across runs.<\/em><\/p>\n\n\n\n<p><strong>For a high-level overview of this technical report please visit our corresponding non-technical blog post <a href=\"https:\/\/research.wpp.com\/blog\/data-quality-assurance-agent-blog-post\">here<\/a>.<\/strong><\/p>\n\n\n\n<h1 class=\"wp-block-heading\"><strong>Data quality assurance agent technical walkthrough<\/strong><\/h1>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Introduction<\/strong><\/h2>\n\n\n\n<p>Data <strong>quality assurance (QA)<\/strong> is a critical bottleneck in modern data engineering pipelines. Engineers frequently dedicate a disproportionate amount of time to manually profiling, verifying, and debugging datasets before they are cleared for downstream consumption by data scientists to build machine learning models. This manual intervention is unscalable, computationally inefficient, and prone to human error, particularly when validating complex, cross-column business logic within wide tables.<\/p>\n\n\n\n<p>To address this infrastructure gap, we architected and deployed the&nbsp;<strong>Data Quality Assurance Agent.<\/strong> Operating directly against our&nbsp;<a href=\"https:\/\/docs.cloud.google.com\/bigquery\/docs\"><strong>Google BigQuery<\/strong><\/a> data warehouse, the agent can autonomously interpret schemas, execute targeted <strong>NL2SQL<\/strong> anomaly detection queries, and generate comprehensive diagnostic reports.<\/p>\n\n\n\n<p>An important feature of this agent is its&nbsp;<strong>long-term memory architecture<\/strong>, hosted via <strong><a href=\"https:\/\/cloud.google.com\/vertex-ai\/generative-ai\/docs\/agent-engine\/overview\">Vertex AI Agent Engine<\/a><\/strong>. By indexing and retrieving historical context across sessions, the agent dynamically suppresses established baseline anomalies and adapts its detection heuristics based on previous human-in-the-loop corrections.<\/p>\n\n\n\n<p>To validate the agent&#8217;s detection capabilities under controlled and reproducible conditions, we developed a synthetic data generation pipeline that injects known structural and logical anomalies at configurable rates into anonymised, marketing data. Evaluation was conducted across&nbsp;3 <strong>experimental configurations<\/strong>, spanning three prompt complexity levels and two memory modes, with scoring fully automated via an LLM-as-a-Judge pipeline using Gemini as the evaluator. The agent achieved a peak detection rate of&nbsp;<strong>0.883%<\/strong>&nbsp;across all injected error categories under forensic-level prompting.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Agent architecture<\/strong><\/h2>\n\n\n\n<p>The solution is structured around a&nbsp;<strong>hierarchical multi-agent orchestration pattern<\/strong>, with a central Root Agent coordinating seven specialized sub-agents, as illustrated in the diagram below. The Root Agent functions as an LLM-powered intent classifier: it parses each incoming user request, decomposes compound instructions into an ordered execution plan, and dynamically routes sub-tasks to the appropriate specialist agent, without relying on hard-coded conditional routing logic. This design enables the system to handle chained, multi-step requests (e.g.,&nbsp;<em>&#8220;query the database and then plot the results&#8221;<\/em>) by composing multiple sub-agents in sequence within a single session. The architecture was inspired by ADK\u2019s official examples <a href=\"https:\/\/github.com\/google\/adk-samples\/tree\/main\/python\/agents\/data-science\">repo<\/a>.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"921\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/image-37-1024x921.png\" alt=\"\" class=\"wp-image-173\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/image-37-1024x921.png 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/image-37-300x270.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/image-37-768x691.png 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/image-37.png 1121w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Sub-agent inventory<\/strong><\/h2>\n\n\n\n<p>The system employs a multi-agent orchestration architecture where a primary&nbsp;<strong>Root Agent<\/strong>&nbsp;delegates tasks to specialized sub-agents based on user intent. All agents are powered by&nbsp;<strong>Gemini 2.5 Flash<\/strong>, optimised for complex multi-step reasoning, low inference latency, and cost efficiency under high request volume.<\/p>\n\n\n\n<figure class=\"wp-block-table has-medium-font-size\"><table><thead><tr><th><strong>Sub-Agent<\/strong><\/th><th><strong>Core Responsibility<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>Auditor Agent<\/strong><\/td><td>Drives autonomous data quality auditing by executing structural and logical checks against BigQuery, maintaining historical context via the Memory Bank.<\/td><\/tr><tr><td><strong>BigQuery Agent<\/strong><\/td><td>Facilitates Text-to-SQL (NL2SQL) translation, generating optimized queries and executing them directly against the data warehouse.<\/td><\/tr><tr><td><strong>Analytics Agent<\/strong><\/td><td>Performs Advanced Data Analysis (NL2Py) by dynamically generating and executing Python code within a secure Vertex AI sandbox for statistical profiling and visualization.<\/td><\/tr><tr><td><strong>BQML Agent<\/strong><\/td><td>Orchestrates BigQuery ML workflows, including model training, batch inference, and model lifecycle management.<\/td><\/tr><tr><td><strong>Artifact Agent<\/strong><\/td><td>Handles session-scoped file management to save, retrieve, and list generated execution artifacts (images, CSVs, PDFs).<\/td><\/tr><tr><td><strong>Report Agent<\/strong>*<\/td><td>Synthesizes audit findings into multi-format reports (Markdown, HTML, PDF, JSON) and manages artifact uploads to Google Cloud Storage.<\/td><\/tr><tr><td><strong>Comparison Agent<\/strong>*<\/td><td>Executes schema-level and volume-level structural comparisons across discrete BigQuery tables.<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<ul class=\"wp-block-list\">\n<li><em>Note: The Report and Comparison agents are fully implemented in the repository but are intentionally disconnected from the Root Agent&#8217;s execution chain in this evaluation instance.<\/em><\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Key system capabilities<\/strong><\/h2>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Dynamic Intent Classification:<\/strong>&nbsp;The Root Agent accurately decomposes complex natural language requests, determines the optimal execution path, and dynamically invokes the correct sub-agent chain.<\/li>\n\n\n\n<li><strong>NL2SQL Querying:<\/strong>&nbsp;The BigQuery Agent translates natural language into optimised SQL, executing it directly against the data warehouse to extract and analyze data without friction.<\/li>\n\n\n\n<li><strong>NL2Py Analysis:<\/strong>&nbsp;The Analytics Agent dynamically generates and executes Python code within a secure Vertex AI Code Interpreter sandbox, enabling advanced statistical profiling, custom visualisations, and complex cross-dataset joins.<\/li>\n\n\n\n<li><strong>Autonomous Data Auditing:<\/strong>&nbsp;The Auditor Agent runs a comprehensive suite of structural and logical validation checks against BigQuery datasets, producing structured, reproducible diagnostic reports.<\/li>\n\n\n\n<li><strong>Stateful Memory Persistence:<\/strong>&nbsp;By querying a persistent&nbsp;<strong>Memory Bank<\/strong>, the Auditor contextualizes newly detected anomalies against historically resolved or suppressed issues, ensuring the agent learns and adapts from past executions.<\/li>\n\n\n\n<li><strong>Multi-Format Report Compilation:<\/strong>&nbsp;The Report Agent synthesizes raw audit findings into polished, user-preferred output formats and automatically pushes the final artifacts to Google Cloud Storage for human review.<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Long-term memory bank<\/strong><\/h2>\n\n\n\n<p>The system&#8217;s&nbsp;<strong>persistent <a href=\"https:\/\/google.github.io\/adk-docs\/sessions\/memory\/\">Memory Bank<\/a><\/strong>, hosted on&nbsp;<a href=\"https:\/\/cloud.google.com\/vertex-ai\/generative-ai\/docs\/agent-engine\/overview\">Vertex AI Agent Engine<\/a>, gives the auditor institutional knowledge across sessions, eliminating cold-start noise and adapting its behaviour to individual user preferences over time. The Memory Bank tracks two custom semantic categories:<\/p>\n\n\n\n<figure class=\"wp-block-table has-medium-font-size\"><table><thead><tr><th><strong>Topic<\/strong><\/th><th><strong>Examples<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>Data Quality Issues<\/strong><\/td><td>Missing columns, inflated metric values, recurring table-level anomalies<\/td><\/tr><tr><td><strong>User Preferences<\/strong><\/td><td><em>&#8220;Always include an executive summary&#8221;<\/em>;&nbsp;<em>&#8220;Flag outliers beyond 3\u03c3&#8221;<\/em><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>How memory is saved<\/strong><\/h3>\n\n\n\n<p>Memory persistence is&nbsp;user-directed. The Auditor invokes the&nbsp;<code>save_memory<\/code>&nbsp;tool only when explicitly asked (e.g.,&nbsp;<em>&#8220;\u2026and save these findings to memory&#8221;<\/em>). The&nbsp;<strong>Vertex AI Agent Engine<\/strong>&nbsp;then asynchronously extracts clean semantic facts from the session, stripping noise and verbose phrasing, and indexes them against the user&#8217;s <code>user_id<\/code> scope. When a new lesson or correction occurs, the agent doesn&#8217;t just blindly append a new memory; instead, it actively scans for similar existing entries. If a related memory is found, the system updates and refines the existing rule rather than creating a duplicate. This deduplication process ensures the knowledge base remains clean, concise, and highly effective, preventing the auditor from getting overwhelmed by redundant information over time.<\/p>\n\n\n\n<p>Agent Engine also&nbsp;<strong>persists the full conversation session<\/strong>&nbsp;alongside the extracted facts, meaning the complete interaction history (including tool calls, SQL queries, and agent reasoning) is retained across runs. This gives the system two complementary layers of recall:&nbsp;<strong>structured fact memory<\/strong>&nbsp;(distilled semantic facts) and&nbsp;<strong>full session continuity<\/strong>&nbsp;(complete conversation history), both managed within a single Vertex AI service.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>How memory is loaded<\/strong><\/h3>\n\n\n\n<p>When a query includes a load instruction (e.g.,&nbsp;<em>&#8220;load memories then check table X&#8221;<\/em>), the Auditor calls the ADK&nbsp;<code>LoadMemoryTool<\/code>, which runs a&nbsp;<strong>similarity search<\/strong>&nbsp;against the Memory Bank scoped to the current&nbsp;<code>user_id<\/code>. Retrieved facts are injected into the agent&#8217;s working context before analysis begins, enabling it to:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Suppress re-flagging of known, already-resolved issues<\/li>\n\n\n\n<li>Apply user formatting preferences from the first response<\/li>\n\n\n\n<li>Re-verify previously detected anomalies to check if they persist<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\"><strong>Technology stack<\/strong><\/h1>\n\n\n\n<figure class=\"wp-block-table has-medium-font-size\"><table><thead><tr><th><strong>Component<\/strong><\/th><th><strong>Technology<\/strong><\/th><th><strong>Role<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>Agent Framework<\/strong><\/td><td><a href=\"https:\/\/google.github.io\/adk-docs\/\">Google Agent Development Kit (ADK)<\/a><\/td><td>Agent orchestration, tool binding, session management, and A2A protocol support<\/td><\/tr><tr><td><strong>LLM (Agents)<\/strong><\/td><td><a href=\"https:\/\/deepmind.google\/technologies\/gemini\/flash\/\">Gemini 2.5 Flash<\/a><\/td><td>Powers all sub-agents; chosen for low latency and strong instruction-following<\/td><\/tr><tr><td><strong>LLM (Judge)<\/strong><\/td><td><a href=\"https:\/\/deepmind.google\/technologies\/gemini\/flash\/\">Gemini 2.5 Flash<\/a><\/td><td>Powers the LLM-as-a-Judge evaluation pipeline; stronger reasoning for unbiased scoring<\/td><\/tr><tr><td><strong>Data Warehouse<\/strong><\/td><td><a href=\"https:\/\/cloud.google.com\/bigquery\/docs\">Google BigQuery<\/a><\/td><td>Primary data store; queried via NL2SQL by both the BigQuery and Auditor agents<\/td><\/tr><tr><td><strong>Code Execution<\/strong><\/td><td><a href=\"https:\/\/cloud.google.com\/vertex-ai\/generative-ai\/docs\/extensions\/code-interpreter\">Vertex AI Code Interpreter<\/a><\/td><td>Sandboxed Python runtime for the Analytics Agent<\/td><\/tr><tr><td><strong>Long-Term Memory<\/strong><\/td><td><a href=\"https:\/\/cloud.google.com\/vertex-ai\/generative-ai\/docs\/agent-engine\/overview\">Vertex AI Agent Engine<\/a><\/td><td>Hosts the persistent Memory Bank; enables cross-session learning and recall<\/td><\/tr><tr><td><strong>Artifact Storage<\/strong><\/td><td><a href=\"https:\/\/cloud.google.com\/storage\/docs\">Google Cloud Storage<\/a><\/td><td>Persistent store for generated reports, JSON profiles, and session artifacts<\/td><\/tr><tr><td><strong>Deployment<\/strong><\/td><td><a href=\"https:\/\/cloud.google.com\/run\/docs\">Google Cloud Run<\/a><\/td><td>Hosts both the A2A API backend and the interactive web UI as containerized services<\/td><\/tr><tr><td><strong>CI\/CD<\/strong><\/td><td><a href=\"https:\/\/support.atlassian.com\/bitbucket-cloud\/docs\/get-started-with-bitbucket-pipelines\/\">Bitbucket Pipelines<\/a><\/td><td>Automated build, Docker image push to Artifact Registry, and Cloud Run deployment on merge<\/td><\/tr><tr><td><strong>Interoperability<\/strong><\/td><td><a href=\"https:\/\/google.github.io\/A2A\/\">A2A Protocol<\/a><\/td><td>Open HTTP-based standard enabling external agents and services to discover and invoke the system programmatically<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Dataset synthesis<\/strong><\/h2>\n\n\n\n<p>Evaluating an autonomous auditing agent requires a controlled, reproducible ground truth, something that real-world production data cannot provide, since errors are unverified by definition. To solve this, we engineered a&nbsp;<strong>modular synthetic corruption pipeline<\/strong>&nbsp;that operates on <a href=\"https:\/\/www.notion.so\/Data-Pod-ff52494e8bd983ab871f8170d7716be2?pvs=21\">proprietary synthetic datasets<\/a> designed to mirror real-world marketing dynamics, and produces deterministically corrupted BigQuery tables accompanied by a complete ground truth registry for automated scoring.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Source dataset<\/strong><\/h3>\n\n\n\n<p>To ensure robust and repeatable results, we start with <a href=\"https:\/\/www.notion.so\/Data-Pod-ff52494e8bd983ab871f8170d7716be2?pvs=21\">proprietary synthetic datasets<\/a> giving us complete control and clear visibility into the drivers of campaign performance. The source dataset is a digital marketing performance table comprising <strong>7,618 rows and 87 feature columns<\/strong>. Each record represents a unique daily measurement at the intersection of a campaign, audience segment, delivery platform, ad placement, and creative asset. Columns are organised into six functional groups:<\/p>\n\n\n\n<figure class=\"wp-block-table has-medium-font-size\"><table><thead><tr><th><strong>Column Group<\/strong><\/th><th><strong>Description<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>Brand &amp; Advertiser<\/strong><\/td><td>Identity of the brand and advertiser running the campaign<\/td><\/tr><tr><td><strong>Campaign &amp; Media Buy<\/strong><\/td><td>Campaign IDs, names, and media buying hierarchy<\/td><\/tr><tr><td><strong>Geo Targeting<\/strong><\/td><td>Geographic targeting and exclusion rules (countries, regions, cities)<\/td><\/tr><tr><td><strong>Audience Targeting<\/strong><\/td><td>Demographic segments: gender, age group, generation, interests, and behaviors<\/td><\/tr><tr><td><strong>Delivery &amp; Platform<\/strong><\/td><td>Campaign objective, platform (Meta\/Instagram), device type, and ad placement<\/td><\/tr><tr><td><strong>Performance Metrics<\/strong><\/td><td>Funnel KPIs: impressions, clicks, spend, conversions, video plays, video completions, landing page views, add-to-cart events, and purchases<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>The&nbsp;Performance Metrics&nbsp;group is the most analytically significant: the columns encode a strict, real-world causal funnel (<code>impressions \u2192 clicks \u2192 landing page views \u2192 add-to-cart \u2192 conversions\/purchases<\/code>) where each downstream metric is physically bounded by the upstream one. Violations of these relationships (for example, <code>clicks &gt; impressions<\/code>) are logically impossible under normal operating conditions. This funnel structure forms the basis for all logical error injection. Additionally, the dual attribution windows (immediate vs. 7-day) introduce latent complexity: the complex prompt level successfully identified cross-window contradictions as an un-injected source of potential logical ambiguity.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Corruption pipeline<\/strong><\/h3>\n\n\n\n<p>The pipeline is structured as a three-stage process. Anonymisation is performed first, followed by structural error injection, and concluding with logical error injection. These stages consist of composable modules that can be selectively enabled or combined to produce datasets with precisely controlled corruption profiles. The error rate is fully configurable per stage and can be held&nbsp;<strong>constant<\/strong>&nbsp;(for fixed-recall benchmarks) or&nbsp;<strong>varied progressively<\/strong>&nbsp;from 5 to 40% (to model the agent&#8217;s sensitivity as a function of corruption severity).<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Stage 1 \u2014 Anonymisation<\/strong><\/h4>\n\n\n\n<p>As a preprocessing step, the pipeline replaces PII and commercially sensitive fields (brand, campaign, creative) with generic identifiers (e.g.,&nbsp;<code>brand_1<\/code>,&nbsp;<code>campaign_1<\/code>), while cleanly preserving all structural relationships.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Stage 2 \u2014 Structural errors<\/strong><\/h4>\n\n\n\n<p>Structural anomalies target individual cells, columns, or rows, and are generally detectable through standard data profiling techniques. This stage consists of five independent injection modules:<\/p>\n\n\n\n<figure class=\"wp-block-table has-medium-font-size\"><table><thead><tr><th><strong>Error Type<\/strong><\/th><th><strong>Simulation \/ Injection Method<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>Missing Values<\/strong>&nbsp;(Nulls)<\/td><td>Injects&nbsp;<code>NaN<\/code>&nbsp;values across a configurable subset of columns to simulate missing or dropped data.<\/td><\/tr><tr><td><strong>Outliers<\/strong><\/td><td>Replaces numeric values with statistical extremes (<code>mean \u00b1 k \u00d7 std<\/code>) to simulate sensor noise or ETL overflow.<\/td><\/tr><tr><td><strong>Duplicate Rows<\/strong><\/td><td>Duplicates randomly selected rows and re-inserts them at random positions to simulate pipeline idempotency failures.<\/td><\/tr><tr><td><strong>Categorical Errors<\/strong><\/td><td>Replaces valid categories with unique random alphanumeric strings (e.g.,&nbsp;<code>a3x7h9<\/code>) guaranteed not to be in any valid vocabulary.<\/td><\/tr><tr><td><strong>Schema Drift<\/strong>&nbsp;(Col Drops)<\/td><td>Randomly removes entire columns to simulate upstream data source failures.<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Stage 3 \u2014 Logical errors<\/strong><\/h3>\n\n\n\n<p>Logical errors are the hardest class of anomalies to detect. Every individual cell value is numerically valid; the violation only becomes apparent when two or more columns are evaluated relationally. This stage injects records that violate any of the following seven business rules:<\/p>\n\n\n\n<figure class=\"wp-block-table has-medium-font-size\"><table><thead><tr><th><strong>#<\/strong><\/th><th><strong>Rule Violated<\/strong><\/th><th><strong>Condition Injected<\/strong><\/th><\/tr><\/thead><tbody><tr><td>1<\/td><td>Clicks \u2264 Impressions<\/td><td><code>clicks &gt; impressions<\/code><\/td><\/tr><tr><td>2<\/td><td>Conversions \u2264 Clicks<\/td><td><code>conversions &gt; clicks<\/code><\/td><\/tr><tr><td>3<\/td><td>Spend requires Impressions<\/td><td><code>spend &gt; 0<\/code>&nbsp;AND&nbsp;<code>impressions = 0<\/code><\/td><\/tr><tr><td>4<\/td><td>Video Completions \u2264 Plays<\/td><td><code>video_completions &gt; video_plays<\/code><\/td><\/tr><tr><td>5<\/td><td>Purchases require Add-to-Cart<\/td><td><code>purchases &gt; 0<\/code>&nbsp;AND&nbsp;<code>add_to_cart = 0<\/code><\/td><\/tr><tr><td>6<\/td><td>Landing Page Views \u2264 Clicks<\/td><td><code>landing_page_views &gt; clicks<\/code><\/td><\/tr><tr><td>7<\/td><td>Non-negative Metric Values<\/td><td>Negative values injected into&nbsp;<code>impressions<\/code>,&nbsp;<code>clicks<\/code>,&nbsp;<code>spend<\/code>, or&nbsp;<code>conversions<\/code><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Ground truth registry<\/strong><\/h2>\n\n\n\n<p>The evaluation framework is anchored by our <strong>ground truth dataset,<\/strong> a structured registry of all <strong>59 BigQuery test tables<\/strong> used in the experiment suite. Each row maps a table&#8217;s BigQuery name to its complete injection specification:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>the number of logical errors injected (out of a maximum of 7 possible rule types)<\/li>\n\n\n\n<li>the exact error type labels (e.g., clicks_exceed_impressions, purchases_without_add_to_cart)<\/li>\n\n\n\n<li>the number of structural errors injected (out of 4 possible types), and their corresponding labels (e.g., null values, outliers, duplicates, categorical errors)<\/li>\n<\/ul>\n\n\n\n<p>The registry covers two tiers of test tables: <strong>48 single-error tables<\/strong> (examples 1\u201348), each containing one isolated error type at varying injection rates of 5%, 10%, 20%, and 40%, and <strong>11 compound synthetic tables<\/strong> (examples 49\u201359) with progressively stacked errors, starting from a single logical violation and escalating to the maximum combination of all 7 logical and all 4 structural error types simultaneously.<\/p>\n\n\n\n<figure class=\"wp-block-table aligncenter has-small-font-size\"><table class=\"has-fixed-layout\"><thead><tr><th><strong>Table Name<\/strong><\/th><th><strong>Logical Errors<\/strong><\/th><th><strong>Logical Error Types<\/strong><\/th><th><strong>Structural Errors<\/strong><\/th><th><strong>Structural Error Types<\/strong><\/th><\/tr><\/thead><tbody><tr><td><code>..._categorical_error_5_percent<\/code><\/td><td>0<\/td><td>\u2014<\/td><td>1 \/ 4<\/td><td><code>categorical errors<\/code><\/td><\/tr><tr><td><code>..._logical_error_1_5_percent<\/code><\/td><td>1 \/ 7<\/td><td><code>clicks_exceed_impressions<\/code><\/td><td>0<\/td><td>\u2014<\/td><\/tr><tr><td><code>log_1_5_pt..._dup_5_pt_cat_5_pt<\/code><\/td><td>7 \/ 7<\/td><td><code>clicks_exceed_impressions, conversions_exceed_clicks, landing_page_views_exceed_clicks, negative_metric_values, purchases_without_add_to_cart, spend_with_zero_impressions, video_completions_exceed_plays<\/code><\/td><td>4 \/ 4<\/td><td><code>null values, outliers, duplicates, categorical errors<\/code><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">Evaluation pipeline<\/h1>\n\n\n\n<p>Rigorous evaluation of the auditor agent is essential to ensure it consistently and accurately identifies true data corruption without generating false positives. To accomplish this, the evaluation pipeline uses an automated, four-step process to continuously assess the agent&#8217;s performance. First, the pipeline utilises synthetic ground truth data stored in BigQuery tables, seeded with deliberate structural and logical errors (such as NULLs, duplicates, and business-rule violations). Second, the auditor agent is executed against these tables through multiple experimental setups, including prompt comparisons (simple vs. complex queries), table anomaly sweeps, and memory ablation studies (cold starts vs. loading past audits). During these runs, the agent uses its SQL tools to investigate the data and generates a comprehensive final audit report.<\/p>\n\n\n\n<p>Third, rather than relying on slow manual review, we automate the evaluation using an LLM-as-a-Judge approach. A separate Gemini Flash instance receives the agent&#8217;s full audit report alongside the complete ground truth registry. Acting as an expert evaluator, the judge compares the outputs and produces a structured scorecard with \u2705\/\u274c verdicts and brief explanations for every error category. This eliminates subjective scoring bias and allows new prompt designs or memory configurations to be evaluated end-to-end in minutes. Finally, these scorecards are parsed to compute precision, recall, and F1 scores per error type, which are then exported to CSV for detailed analysis.<\/p>\n\n\n\n<p>This is also illustrated in the diagram below:<\/p>\n\n\n\n<pre class=\"wp-block-code has-small-font-size\"><code>\u250c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2510\n\u2502                       EVALUATION PIPELINE                      \u2502\n\u251c\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2524\n\u2502 1. Synthetic Data   \u2502 Tables in BigQuery with injected errors: \u2502\n\u2502    (ground truth)   \u2502 NULLs, duplicates, outliers, categorical \u2502\n\u2502                     \u2502 errors, logical violations.              \u2502\n\u2502                     \u2502                                          \u2502\n\u2502 2. Run Auditor      \u2502 4 experiments test different factors:    \u2502\n\u2502    Agent            \u2502  \u2192 Exp 1: Prompt Comparison              \u2502\n\u2502                     \u2502  \u2192 Exp 2: Table Sweep                    \u2502\n\u2502                     \u2502  \u2192 Exp 3: Memory Ablation                \u2502\n\u2502                     \u2502 Agent uses tools to run SQL &amp; produce an \u2502\n\u2502                     \u2502 audit report per run.                    \u2502\n\u2502                     \u2502                                          \u2502\n\u2502 3. LLM-as-Judge     \u2502 Gemini Flash compares agent report to    \u2502\n\u2502    (Gemini Flash)   \u2502 ground truth.                            \u2502\n\u2502                     \u2502  \u2192 Scores each error: \u2705 detected \/ \u274c   \u2502\n\u2502                     \u2502                                          \u2502\n\u2502 4. Metrics          \u2502 Parse scorecards \u2192 compute precision,    \u2502\n\u2502    Generation       \u2502 recall, F1 per error type.               \u2502\n\u2502                     \u2502  \u2192 Save to CSV                           \u2502\n\u2514\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2500\u2518\n<\/code><\/pre>\n\n\n\n<h2 class=\"wp-block-heading\">Experimental setup<\/h2>\n\n\n\n<p>To rigorously validate the Auditor agent&#8217;s detection capabilities, we designed a suite of three complementary experiments, each isolating a different factor that influences audit performance:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Experiment 1 \u2014 Prompt Comparison:<\/strong>&nbsp;Measures how the&nbsp;<em>complexity and specificity<\/em>&nbsp;of the user prompt affects the agent&#8217;s ability to detect both structural and logical errors, comparing a simple exploratory prompt against a medium-structured prompt and a forensic-level complex prompt.<\/li>\n\n\n\n<li><strong>Experiment 2 \u2014 Table Sweep:<\/strong>\u00a0Stress-tests the agent&#8217;s <em>scalability and robustness<\/em> by sweeping across 11 synthetic tables with progressively stacked error combinations, ranging from a single isolated violation to the maximum of 11 simultaneous error types. This maps the detection ceiling under the best-performing prompt.<\/li>\n\n\n\n<li><strong>Experiment 3 \u2014 Memory Ablation:<\/strong>&nbsp;Isolates the&nbsp;<em>contribution of the long-term Memory Bank<\/em>&nbsp;by comparing a cold-start baseline (no prior context) against a memory-augmented run, quantifying how historical context from past audit sessions improves detection accuracy.<\/li>\n<\/ol>\n\n\n\n<p>Together, these experiments span the key dimensions of agent performance: prompt engineering, error complexity, and contextual memory, providing a comprehensive view of the system&#8217;s strengths and current limitations. All experiments use the same synthetic corruption pipeline and LLM-as-a-Judge scoring framework described above.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Experiment 1: Prompt Comparison<\/h3>\n\n\n\n<p>Our first research question was whether <strong>prompt specification<\/strong> (instructional structure, domain constraints, and required check set) is a first-order driver of audit performance, independent of the underlying dataset and injected corruption profile. In other words, does increasing <strong>prompt information content<\/strong> and enforcing explicit <strong>cross-column invariants<\/strong> improve the agent\u2019s ability to surface structural anomalies and relational business-rule violations, and what is the marginal lift as we move from a zero-shot \u201chealth check\u201d prompt to a forensic, hypothesis-driven audit prompt?<\/p>\n\n\n\n<p>To isolate this variable, we held the dataset and error profile constant, injecting known errors at a flat 5% rate per type into a table of anonymised marketing data, and varied only the prompt complexity across three levels:<\/p>\n\n\n\n<figure class=\"wp-block-table has-medium-font-size\"><table><thead><tr><th><strong>Prompt Level<\/strong><\/th><th><strong>Description<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Simple<\/td><td>Basic health check: explore, verify, report<\/td><\/tr><tr><td>Medium<\/td><td>Structured assessment organized by data quality pillars<\/td><\/tr><tr><td>Complex<\/td><td>Forensic audit with cross-column hypothesis testing and business context<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Results<\/h4>\n\n\n\n<p>To quantify the impact of prompt engineering, we measured the detection accuracy for each of the three prompt levels against our ground truth dataset. The table below summarizes the results:<\/p>\n\n\n\n<figure class=\"wp-block-table has-medium-font-size\"><table><thead><tr><th><strong>Metric<\/strong><\/th><th><strong>Simple Prompt<\/strong><\/th><th><strong>Medium Prompt<\/strong><\/th><th><strong>Complex Prompt<\/strong><\/th><\/tr><\/thead><tbody><tr><td>Structural errors detected<\/td><td>3\/4<\/td><td>3\/4<\/td><td>4\/4<\/td><\/tr><tr><td>Logical errors detected<\/td><td>1\/7<\/td><td>3\/7<\/td><td>4\/7<\/td><\/tr><tr><td>Total score<\/td><td>4\/11 (36%)<\/td><td>6\/11 (55%)<\/td><td>8\/11 (73%)<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>The Simple Prompt (scoring 4 out of 11) successfully detected missing values, outliers, categorical errors, and negative metric values, but failed to detect duplicate rows and missed most cross-column logical violations. The Medium Prompt (scoring 6 out of 11) was a significant step up; it detected missing values, identified duplicate rows, and found categorical errors, while additionally detecting key funnel violations like clicks being greater than impressions and conversions being greater than clicks. The Complex Prompt (scoring 8 out of 11) was the strongest performer, achieving 100% on structural errors with forensic-level explanations. On logical errors, it detected negative metrics, two funnel violations, and video completion inconsistencies, and the Auditor also discovered un-injected errors beyond the seeded corruption, including data mapping flaws. Our key observations are as following:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li><strong>Prompt complexity directly impacts detection quality.<\/strong> Moving from simple to complex prompts increased total detection from 36% to 73%.<\/li>\n\n\n\n<li><strong>Structural errors are easier to detect than logical errors.<\/strong> Even the simplest prompt found 75% of structural errors, while logical error detection ranged from 14% to 57%.<\/li>\n\n\n\n<li><strong>The complex prompt exhibited emergent behaviour<\/strong>, discovering data quality issues beyond the injected errors, which validates the agent&#8217;s analytical depth. Specifically, it identified a many-to-one mapping flaw where a single campaign_id mapped to multiple campaign_names, and logical contradictions between 7-day and immediate conversion windows.<\/li>\n\n\n\n<li><strong>Error analysis reveals specific failure modes.<\/strong> For the &#8220;Spend &gt; 0 while Impressions = 0&#8221; error, the agent checked the inverse condition (&#8220;Impressions &gt; 0 AND Spend = 0&#8221;), demonstrating that the agent&#8217;s logical reasoning was sound but directionally inverted. This suggests that targeted few-shot examples or tool-level guardrails could address remaining gaps.<\/li>\n\n\n\n<li><strong>Certain error types remain challenging<\/strong> regardless of prompt level, particularly those requiring knowledge of the full marketing funnel (e.g., purchases without add-to-cart, landing page views vs. clicks). These represent areas for future improvement because evaluating complex logical anomalies requires a deep contextual understanding of domain-specific business rules. Providing this context, whether through a persistent memory system that stores historical performance baselines and funnel definitions, or via highly explicit user prompts that clearly map expected relationships, is essential for the agent to accurately validate these scenarios rather than relying on generic data logic.<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Experiment 2: Table sweep<\/h3>\n\n\n\n<p>Having identified the complex prompt as the strongest performer, we next evaluated its scaling <strong>behaviour under increasing anomaly superposition<\/strong>: specifically, how detection performance (precision\/recall trade-offs) degrades or saturates as the number of simultaneously injected error modes per table increases. While a single-error table primarily probes <em>per-check sensitivity<\/em>, production-like settings exhibit <strong>error co-occurrence and interaction effects<\/strong> (masking, confounding, and correlated rule violations) that can materially alter the agent\u2019s search strategy, query budget, and false-positive propensity.<\/p>\n\n\n\n<p>To probe this, we ran the Auditor against&nbsp;<strong>11 synthetic BigQuery tables<\/strong>&nbsp;with progressively stacked error combinations \u2014 from a single isolated logical violation up to the maximum of all 7 logical and all 4 structural error types simultaneously (11 errors total per table). All runs used the&nbsp;<code>complex<\/code>&nbsp;prompt level, allowing us to map the agent&#8217;s detection ceiling as the error landscape grows increasingly complex.<\/p>\n\n\n\n<p><strong>Results: Per-table and Aggregate Metrics<\/strong><\/p>\n\n\n\n<p>*(Legend:&nbsp;L = Logical errors, S = Structural errors)<\/p>\n\n\n\n<figure class=\"wp-block-table has-medium-font-size\"><table><thead><tr><th><strong>Table<\/strong><\/th><th><strong>Error Profile<\/strong><\/th><th><strong>Expected<\/strong><\/th><th><strong>TP<\/strong><\/th><th><strong>FP<\/strong><\/th><th><strong>FN<\/strong><\/th><th><strong>F1 Score<\/strong><\/th><\/tr><\/thead><tbody><tr><td><code>synthetic_1_log_error<\/code><\/td><td>1L<\/td><td>1<\/td><td>1<\/td><td>0<\/td><td>0<\/td><td><strong>1.000<\/strong>&nbsp;\u2705<\/td><\/tr><tr><td><code>synthetic_2_log_errors<\/code><\/td><td>2L<\/td><td>2<\/td><td>2<\/td><td>6<\/td><td>0<\/td><td><strong>0.400<\/strong>&nbsp;\u26a0\ufe0f<\/td><\/tr><tr><td><code>synthetic_3_log_errors<\/code><\/td><td>3L<\/td><td>3<\/td><td>0<\/td><td>0<\/td><td>3<\/td><td><strong>0.000<\/strong>&nbsp;\u274c<\/td><\/tr><tr><td><code>synthetic_4_log_errors<\/code><\/td><td>4L<\/td><td>4<\/td><td>4<\/td><td>0<\/td><td>0<\/td><td><strong>1.000<\/strong>&nbsp;\u2705<\/td><\/tr><tr><td><code>synthetic_5_log_errors<\/code><\/td><td>5L<\/td><td>5<\/td><td>5<\/td><td>0<\/td><td>0<\/td><td><strong>1.000<\/strong>&nbsp;\u2705<\/td><\/tr><tr><td><code>synthetic_6_log_errors<\/code><\/td><td>6L<\/td><td>6<\/td><td>6<\/td><td>0<\/td><td>0<\/td><td><strong>1.000<\/strong>&nbsp;\u2705<\/td><\/tr><tr><td><code>synthetic_7_log_errors<\/code><\/td><td>7L<\/td><td>7<\/td><td>7<\/td><td>0<\/td><td>0<\/td><td><strong>1.000<\/strong>&nbsp;\u2705<\/td><\/tr><tr><td><code>synthetic_7_log_1_struct<\/code><\/td><td>7L+1S<\/td><td>8<\/td><td>2<\/td><td>0<\/td><td>6<\/td><td><strong>0.400<\/strong>&nbsp;\u26a0\ufe0f<\/td><\/tr><tr><td><code>synthetic_7_log_2_struct<\/code><\/td><td>7L+2S<\/td><td>9<\/td><td>9<\/td><td>0<\/td><td>0<\/td><td><strong>1.000<\/strong>&nbsp;\u2705<\/td><\/tr><tr><td><code>synthetic_7_log_3_struct<\/code><\/td><td>7L+3S<\/td><td>10<\/td><td>10<\/td><td>0<\/td><td>0<\/td><td><strong>1.000<\/strong>&nbsp;\u2705<\/td><\/tr><tr><td><code>synthetic_7_log_4_struct<\/code><\/td><td>7L+4S<\/td><td>11<\/td><td>11<\/td><td>0<\/td><td>0<\/td><td><strong>1.000<\/strong>&nbsp;\u2705<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<figure class=\"wp-block-table has-medium-font-size\"><table><thead><tr><th><strong>Metric<\/strong><\/th><th><strong>Value<\/strong><\/th><\/tr><\/thead><tbody><tr><td><strong>Perfect Detection (F1 = 1.0)<\/strong><\/td><td><strong>8 \/ 11 tables (72.7%)<\/strong><\/td><\/tr><tr><td><strong>Total True Positives (TP)<\/strong><\/td><td>57<\/td><\/tr><tr><td><strong>Total False Positives (FP)<\/strong><\/td><td>6<\/td><\/tr><tr><td><strong>Total False Negatives (FN)<\/strong><\/td><td>9<\/td><\/tr><tr><td><strong>Overall Precision<\/strong><\/td><td>57 \/ 63 =&nbsp;<strong>0.905<\/strong><\/td><\/tr><tr><td><strong>Overall Recall<\/strong><\/td><td>57 \/ 66 =&nbsp;<strong>0.864<\/strong><\/td><\/tr><tr><td><strong>Overall F1 Score<\/strong><\/td><td><strong>0.883<\/strong><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>We also tested the auditor agent&#8217;s baseline ability to detect the same logical error at different prevalence levels (5%, 10%, 20%, and 40%). The agent successfully detected and accurately quantified the discrepancy at the 5%, 10%, 20% and 40% rates, demonstrating robust, range-agnostic capability that catches both rare edge cases and widespread corruption equally well. Ultimately, the results indicate that error rate prevalence does not significantly impact the agent&#8217;s detection performance when the audit completes successfully.<\/p>\n\n\n\n<p>Finally, we ran the identical configuration three times for one table as a <strong>consistency check<\/strong>, and observed perfect reproducibility: the auditor consistently detected both injected errors with the same metrics and explanations across all three runs. This deterministic behaviour indicates that the complex prompt configuration is stable, reducing the need for redundant audits.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Experiment 3: Memory ablation<\/h3>\n\n\n\n<p>The previous experiments characterized the agent\u2019s <strong>single-session capability envelope<\/strong> under a fixed prompt specification. In a production setting, however, auditing is inherently <strong>iterative and longitudinal<\/strong>: the agent re-encounters the same schemas, recurring anomaly modes, and known \u201cbenign\u201d deviations across repeated runs. This motivates a key question: does <strong>persistent, user-scoped memory<\/strong> (i.e., accumulated priors from prior audits) measurably improve detection performance and efficiency over time by biasing the agent toward higher-yield checks, reinstating domain-specific invariants without re-deriving them from scratch?<\/p>\n\n\n\n<p>To isolate the contribution of the long-term Memory Bank, we ran the agent twice on the same table under identical conditions, first with no prior context (cold start) and then with memories loaded from previous audit sessions. We evaluated the agent on a synthetic table (<code>synthetic_7_log_4_struct<\/code>) containing 7,999 rows, deliberately corrupted with 11 distinct error types (4 structural, 7 logical) at a ~5% error rate. The two conditions differed only in whether the agent had access to its Memory Bank before beginning the audit.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Results<\/h4>\n\n\n\n<p>Without memory, the agent received a minimalist zero-shot prompt (<em>&#8220;Check if there are any errors for table X?&#8221;<\/em>) and relied solely on exploratory analysis. Under these cold-start conditions, it achieved an overall detection rate of <strong>45% (5\/11)<\/strong>, identifying 2 of 4 structural errors and 3 of 7 logical errors.<\/p>\n\n\n\n<p>When the same agent was instructed to load past context (<em>&#8220;load memories about auditing tables&#8230;&#8221;<\/em>), the results improved dramatically. By retrieving specific logical checks and known error patterns from prior sessions, the memory-augmented agent achieved a <strong>91% detection rate (10\/11)<\/strong>, a 102% relative improvement over the baseline.<\/p>\n\n\n\n<p>Structural error detection reached a perfect 100% (4\/4), while logical error detection rose from 43% to 86% (6\/7), successfully uncovering complex violations such as negative metric values and spend recorded against zero impressions.<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"766\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/memory_detection_comparison-1024x766.png\" alt=\"\" class=\"wp-image-1145\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/memory_detection_comparison-1024x766.png 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/memory_detection_comparison-300x224.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/memory_detection_comparison-768x574.png 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/memory_detection_comparison-1536x1149.png 1536w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/memory_detection_comparison-2048x1532.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The figure shows a clear performance gap between the memory-augmented agent (blue) and the baseline agent without memory (red). For structural errors, memory enabled perfect detection (100%) compared to 50% without memory. For logical errors, memory improved detection from 43% to 86%, demonstrating that access to prior audit patterns and domain knowledge substantially enhances the agent&#8217;s ability to identify complex data quality issues beyond basic exploratory analysis.<\/p>\n\n\n\n<p>The sole undetected error was a funnel sequence violation (purchases without add-to-cart). The agent did not miss this check due to a detection failure. It correctly reasoned that the validation was impossible given the aggregated schema, which lacked the transaction-level granularity required to verify a purchase-to-cart relationship. This suggests the miss was an analytically sound decision rather than a detection failure.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Memory vs. prompt complexity<\/strong><\/h3>\n\n\n\n<p>These results raise an important nuance: if a prompt is already sufficiently detailed and structurally prescriptive (as in our complex prompt from Experiment 1), the memory module provides only marginal uplift. However, memory becomes highly valuable in continuous operational scenarios, where its benefits compound over time:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Adaptability:<\/strong>&nbsp;The agent iteratively learns from past edge cases, refining its checks with each audit cycle.<\/li>\n\n\n\n<li><strong>Contextual Awareness:<\/strong>&nbsp;It builds a deep, automated understanding of project-specific business rules and historically common data quality issues.<\/li>\n\n\n\n<li><strong>Consistency &amp; Efficiency:<\/strong>&nbsp;Audit coverage remains stable across sessions, with fewer redundant exploratory queries needed to reach comprehensive detection.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\"><strong>Cloud deployment<\/strong><\/h1>\n\n\n\n<p>The system is deployed as a production-grade, cloud-native service on <strong>Google Cloud<\/strong>, following a containerised, infrastructure-as-code workflow from local development through to automated CI\/CD and managed compute.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>CI\/CD pipeline<\/strong><\/h2>\n\n\n\n<p>The project uses a fully automated&nbsp;<strong>Bitbucket Pipelines<\/strong>&nbsp;CI\/CD pipeline with two distinct execution stages:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>On Pull Request:<\/strong>&nbsp;Automated linting and static analysis run immediately to enforce code quality standards before any merge is permitted.<\/li>\n\n\n\n<li><strong>On Merge to&nbsp;<code>main<\/code>:<\/strong>&nbsp;The pipeline builds <strong>two independent Docker images<\/strong> (one for the headless A2A API backend, one for the interactive web UI), pushes both to <strong>Google Artifact<\/strong> <strong>Registry<\/strong>, and triggers rolling deployments to their respective&nbsp;<strong>Cloud Run<\/strong>&nbsp;services. All runtime configuration (model identifiers, dataset IDs, memory service URIs, Cloud Storage bucket names) is injected exclusively via environment variables, ensuring no secrets or environment-specific values are hardcoded into the images.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Dual-service deployment architecture<\/strong><\/h2>\n\n\n\n<p>The agent is deployed as&nbsp;<strong>two independent, containerised Cloud Run services<\/strong>, each built from its own Dockerfile and serving a distinct class of consumer:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Service 1 \u2014 A2A API backend<\/strong><\/h3>\n\n\n\n<p>The backend service exposes a headless <strong>Agent-to-Agent (A2A)<\/strong> interface, an open, HTTP-based protocol designed for agent interoperability across frameworks. It publishes an&nbsp;<strong>Agent Card&nbsp;<\/strong>(a structured capability manifest) that allows any external service or AI agent to programmatically discover what the Data Quality Agent can do without requiring any knowledge of the underlying ADK implementation.<\/p>\n\n\n\n<p>Clients interact with the backend by sending structured&nbsp;<strong>JSON-RPC messages<\/strong>&nbsp;over standard HTTP. This means the auditor can be:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Integrated into classical data pipelines<\/strong>&nbsp;(like Airflow or dbt) to trigger automatic quality checks.<\/li>\n\n\n\n<li><strong>Orchestrated by other AI agents<\/strong>&nbsp;as part of a larger, automated workflow.<\/li>\n\n\n\n<li><strong>Invoked from any programming language<\/strong>, completely independent of the underlying Python stack.<\/li>\n\n\n\n<li><strong>Embedded in CI\/CD or alerting systems<\/strong>&nbsp;using simple HTTP requests.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Service 2 \u2014 Interactive Web UI<\/strong><\/h3>\n\n\n\n<p>The web UI service hosts an interactive conversational frontend, allowing data engineers and data scientists to interact directly with the full agent system through a browser. It communicates with the agent backend and provides a session-aware interface where users can issue audit requests, review structured findings, retrieve generated reports, and provide manual corrections that are subsequently persisted to the Memory Bank.<\/p>\n\n\n\n<p><strong>Google Cloud Agent Engine<\/strong>&nbsp;provides shared, persistent session storage for both services, ensuring that conversation context and session state survive container restarts and instance scale-out events.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h1 class=\"wp-block-heading\">Conclusion<\/h1>\n\n\n\n<p>This report demonstrates a highly effective and intelligent agent for automating data quality assurance, utilizing a long-term memory architecture that not only frees up valuable engineering resources but also gets smarter with every interaction. By reclaiming data engineering bandwidth, it liberates engineers to focus on building infrastructure rather than performing manual data QA. It also catches errors in BigQuery tables before data scientists spend hours training models on corrupted data, shifting quality checks earlier in the pipeline. Ultimately, this compound intelligence ensures the system never resets; instead, every manual correction and interaction makes the auditor permanently better and more adapted to our data ecosystem.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Lessons learned<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Test with Synthetic Data First<\/strong>: Without a meticulously crafted synthetic dataset, we would have had no objective way to measure if our prompt strategies were improving the agent&#8217;s performance.<\/li>\n\n\n\n<li><strong>Memory is Context, Context is King<\/strong>: The ability to retrieve facts from past runs, including past errors, user feedback, and specific constraints, is what makes the difference between a stateless tool and an adaptive auditor.<\/li>\n\n\n\n<li><strong>Start Specific, Then Generalize<\/strong>: We focused on nailing the Auditor Agent&#8217;s specific use case with BigQuery first. This created a robust foundation before we expanded to other functions like report generation.<\/li>\n\n\n\n<li><strong>Leverage a Unified Cloud Ecosystem<\/strong>: Building entirely on <strong>Google Cloud services<\/strong> (ADK, Vertex AI, BigQuery, Cloud Run, Cloud Storage) eliminated integration friction between components and allowed us to move from prototype to production deployment without stitching together tools from multiple vendors.<\/li>\n<\/ul>\n","protected":false},"author":20,"featured_media":741,"template":"","meta":{"_acf_changed":false},"pod_status":[{"id":4,"name":"Active","slug":"active"}],"ppma_author":[{"id":20,"display_name":"Anastasios Stamoulakatos","first_name":"Anastasios","last_name":"Stamoulakatos","nickname":"anastasios.stamoulakatos","user_nicename":"anastasios-stamoulakatos","user_email":"anastasios.stamoulakatos@satalia.com","biographical_info":"Anastasios (Tasos) Stamoulakatos is a Data Scientist at Satalia (WPP), focusing on agentic AI solutions for marketing. His work spans multi-agent systems, RAG and GraphRAG, and image retrieval, developing scalable AI solutions from early-stage POCs to production. He holds a PhD in Applied AI and Computer Vision from the University of Strathclyde and has over four years of commercial experience across industries including marketing, agriculture, pharmaceuticals, oil and gas, and manufacturing, with a strong focus on applied research and turning complex AI into practical business value.","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/headshot_small.jpg","job_title":"Data Scientist","is_lead":null,"display_as_researcher":null,"order_priority":null},{"id":21,"display_name":"Thanos Lyras","first_name":"Thanos","last_name":"Lyras","nickname":"thanos.lyras","user_nicename":"thanos-lyras","user_email":"thanos.lyras@satalia.com","biographical_info":"Thanos Lyras is a data scientist at Satalia specializing in building end-to-end AI pipelines and deploying real-world AI applications. A graduate in Computer Engineering with an MSc in Data Science, his work has led to research contributions in the fields of Big Data, AI, and database performance. Currently, he is focused on pioneering research in agentic projects, exploring the next wave of artificial intelligence.","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/04\/profile_photo.png","job_title":"Data Scientist","is_lead":false,"display_as_researcher":true,"order_priority":null}],"class_list":["post-155","research_pod","type-research_pod","status-publish","has-post-thumbnail","hentry","pod_status-active"],"acf":{"subtitle":"From manual checks to machine intuition: an AI agent that guards your data quality and never forgets a lesson.","quarter":"Q1 2026","focus_focus":"","is_featured":false},"related_publications":[{"type":"research_feed","title":"New Blog Post","slug":"new-blog-post-2","date":"2026-03-30T14:58:56+00:00"},{"type":"blog","title":"Meet Your New Agentic Data Guardian","slug":"data-quality-assurance-agent-blog-post","date":"2026-03-24T15:40:43+00:00"}],"featured_image_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/data_guardian_agent-resized-to-1024x600-1.jpeg","featured_image_sizes":{"thumbnail":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/data_guardian_agent-resized-to-1024x600-1-150x150.jpeg","medium":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/data_guardian_agent-resized-to-1024x600-1-300x176.jpeg","large":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/data_guardian_agent-resized-to-1024x600-1.jpeg","full":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/data_guardian_agent-resized-to-1024x600-1.jpeg"},"_links":{"self":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/research_pods\/155","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/research_pods"}],"about":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/types\/research_pod"}],"author":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/users\/20"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/media\/741"}],"wp:attachment":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=155"}],"wp:term":[{"taxonomy":"pod_status","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fpod_status&post=155"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fppma_author&post=155"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}