{"id":1651,"date":"2026-03-24T11:58:44","date_gmt":"2026-03-24T11:58:44","guid":{"rendered":"https:\/\/cms.research.wpp.com\/?post_type=research_feed&#038;p=1651"},"modified":"2026-06-17T12:39:23","modified_gmt":"2026-06-17T12:39:23","slug":"brand-perception-atlas-pod-technical-walkthrough","status":"publish","type":"research_feed","link":"https:\/\/cms.research.wpp.com\/?research_feed=brand-perception-atlas-pod-technical-walkthrough","title":{"rendered":"Brand Perception Atlas Pod: Technical walkthrough"},"content":{"rendered":"","protected":false},"excerpt":{"rendered":"","protected":false},"author":19,"featured_media":0,"template":"","meta":{"_acf_changed":false},"tags":[],"content_types":[{"id":51,"name":"Technical Walkthrough","slug":"technical-walkthrough"}],"ppma_author":[{"id":19,"display_name":"Jaclyn Harron","first_name":"Jaclyn","last_name":"Harron","nickname":"jaclyn.harron","user_nicename":"jaclyn-harron","user_email":"jaclyn.harron@satalia.com","biographical_info":"Jaclyn is a Senior Data Scientist and Chartered Statistician, holding a Ph.D. in Applied Statistics. Her work focuses on bridging advanced statistical modelling with real-world applications, translating complex data into meaningful, actionable insight.\r\n\r\nShe has a strong background in both theoretical research and applied data science, specialising in time series forecasting, causal analysis, and machine learning for large-scale, high-dimensional data.","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/04\/jaclyn-1.jpg","job_title":null,"is_lead":null,"display_as_researcher":null,"order_priority":null}],"class_list":["post-1651","research_feed","type-research_feed","status-publish","hentry","content_type-technical-walkthrough"],"acf":{"content":"<p><!-- wp:paragraph --><\/p>\n<p><em>Brand leaders today are navigating without a map. Social media says one thing, surveys say another, and AI-generated content adds yet another layer, leaving strategists with no reliable way to know whether their brand\u2019s identity is consistent or not across channels. The Brand Perception Atlas solves this by bringing together five different sources of brand perception, TikTok, Instagram, Wikipedia, AI-generated summaries, and WPP\u2019s 30-year Brand Asset Valuator\u00ae (BAV) survey, into a single, visual map covering over 200 brands and 4,000+ data points. By placing all of these different signals on one comparable scale, it makes it possible for the first time to see how a brand looks across platforms side by side. Several notable patterns emerge from the analysis, revealing that some brands, e.g. Brand D, an industrial equipment manufacturer, maintain a unified identity everywhere. While others, e.g. Brand G, a global hospitality company, occupy distinct perceptual territories across platforms, that may reflect a deliberate strategy to engage different audiences. The study also identifies &#8220;shared equity&#8221; pairs, revealing brands in different industries that perform the same emotional job for consumers, despite looking different online.<\/em><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong><strong>If you don&#8217;t care about the technical details, read <a href=\"https:\/\/research.wpp.com\/blog\/brand-perception-atlas-mapping-the-modern-brand-from-social-signal-to-core-equity\">our blog post <\/a>instead.<\/strong> The GitHub repo is also coming soon.<\/strong>  <\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":1} --><\/p>\n<h1 class=\"wp-block-heading\">The Brand Perception Atlas &#8211; a technical deep dive<\/h1>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>The <em>Brand Perception Atlas<\/em>, is an interactive decision-support tool that helps brand teams understand, compare, and explain brand perception across platforms. It combines embedding-space visualisation (UMAP) with interpretable clusters, and cross-platform consistency scoring. The result is a tool that moves teams from &#8220;interesting maps&#8221; to recommendations that are better documented and easier to explain, because they link back to specific underlying perception signals and clearly show where sources agree or disagree. These can be used in brand reviews, competitor analysis, and campaign planning.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><em>Goal: a newcomer can reproduce the core outcomes in 2\u20133 days. For full setup and running instructions, refer to the GitHub README <em>(GitHub repo coming soon<\/em>) \u2014 this walkthrough provides the conceptual map and domain knowledge needed to understand what the code does and why.<\/em><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>\ud83e\udded <strong>A note on what made this hard.<\/strong> The <em>Brand Perception Atlas<\/em> looks deceptively simple \u2014 embed text, project it, cluster it, display it. In reality, the single hardest problem was getting <strong>five fundamentally different perception sources<\/strong> to coexist in a shared space where distances actually mean something. WPP <a href=\"https:\/\/wppbav.com\/\">Brand Asset Valuator\u00ae (BAV)<\/a> data comes from structured survey scores transformed through an LLM (<a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a>) into prose. Social data comes from raw video transcripts and post captions, also transformed through an LLM (<a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a>). Even though both pass through the same embedding model, the <em>linguistic fingerprint<\/em> of each source dominates the resulting vectors \u2014 the map would split cleanly into &#8220;survey-sounding text&#8221; vs &#8220;social-sounding text&#8221; rather than grouping brands by actual perception. Solving this required iterating through prompt normalisation, aggregation rebalancing, and ultimately <strong>Procrustes alignment<\/strong> \u2014 a technique borrowed from shape analysis that rotates one embedding subspace onto another using shared anchor brands. Section 6 tells this story in full.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading --><\/p>\n<h2 class=\"wp-block-heading\"><strong>1. Architecture overview<\/strong><\/h2>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:image {\"id\":402,\"sizeSlug\":\"large\",\"linkDestination\":\"none\"} --><\/p>\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"559\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/image-55-1024x559.png\" alt=\"\" class=\"wp-image-402\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/image-55-1024x559.png 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/image-55-300x164.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/image-55-768x419.png 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/image-55-1536x838.png 1536w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/image-55-2048x1117.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n<p><!-- \/wp:image --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Key constraint:<\/strong> All five perception sources must be projected into a <em>shared<\/em> embedding space so that distances are comparable across sensors. The pipeline uses <strong>Procrustes alignment<\/strong> to rotate <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> vectors into the social subspace via overlapping anchor brands, then <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> projects everything into 2D. The system maps the <em>ideas<\/em> behind the words, not just the text \u2014 &#8220;Luxury,&#8221; &#8220;Premium,&#8221; and &#8220;Prestigious&#8221; land in the same neighbourhood.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>\ud83d\udca1 <strong>Why this architecture isn&#8217;t obvious.<\/strong> A naive approach would be: embed everything \u2192 <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> \u2192 cluster \u2192 done. The catch is that embedding models encode <em>how<\/em> something is said as much as <em>what<\/em> is said. A <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> narrative generated from &#8220;Helpful (45.2), Reliable (38.1)\u2026&#8221; reads nothing like a <a href=\"https:\/\/www.tiktok.com\/\">TikTok<\/a> perception report, even when both describe the same brand. Without the Procrustes step between &#8220;Ingest&#8221; and &#8220;<a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a>,&#8221; the map would split by <em>text style<\/em> rather than <em>brand perception<\/em>. The arrows in this diagram look linear, but getting Step 3 (Aggregate) right took more iteration than every other step combined.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading --><\/p>\n<h2 class=\"wp-block-heading\"><strong>2. Setup &amp; running<\/strong><\/h2>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><em>Full install and deploy instructions are maintained in the README. Below is a summary for orientation.<\/em><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Public (GitHub ):<\/strong><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":false,\"fontSize\":\"medium\"} --><\/p>\n<figure class=\"wp-block-table has-medium-font-size\">\n<table>\n<thead>\n<tr>\n<th>Mode<\/th>\n<th>Description<\/th>\n<th>API keys required?<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Default<\/strong><\/td>\n<td>Run locally with provided toy dataset (20 brands, 681 reports) or own dataset in same CSV format<\/td>\n<td>No<\/td>\n<\/tr>\n<tr>\n<td><strong>Advanced<\/strong><\/td>\n<td>Plug in own API keys for LLM clustering and embedding models<\/td>\n<td>Yes (plug in own API keys)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Key dependencies:<\/strong> Python 3.12, <a href=\"https:\/\/docs.lancedb.com\/quickstart\"><code>lancedb<\/code><\/a>, <a href=\"https:\/\/pandas.pydata.org\/\"><code>pandas<\/code><\/a>, <a href=\"https:\/\/numpy.org\/\"><code>numpy<\/code><\/a>, <a href=\"https:\/\/docs.cloud.google.com\/python\/docs\/reference\/aiplatform\/latest\"><code>google-cloud-aiplatform<\/code><\/a>, <a href=\"https:\/\/docs.cloud.google.com\/vertex-ai\/generative-ai\/docs\/sdks\/overview\"><code>google-genai<\/code><\/a>, <a href=\"https:\/\/tqdm.github.io\/\"><code>tqdm<\/code><\/a>, <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\"><code>umap-learn<\/code><\/a>, <a href=\"https:\/\/hdbscan.readthedocs.io\/en\/latest\/\"><code>hdbscan<\/code><\/a>, <a href=\"https:\/\/scikit-learn.org\/stable\/\"><code>scikit-learn<\/code><\/a>, <a href=\"https:\/\/streamlit.io\/\"><code>streamlit<\/code><\/a><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>\u26a0\ufe0f <strong>Gotchas for newcomers:<\/strong><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><a href=\"https:\/\/cloud.google.com\/vertex-ai\">Vertex AI<\/a> quotas. The embed step hits <a href=\"https:\/\/ai.google.dev\/gemini-api\/docs\/embeddings\"><code>gemini-embedding-001<\/code><\/a> in batches of 50. On a fresh <a href=\"https:\/\/cloud.google.com\/\">Google Cloud Project (GCP)<\/a> you may get rate-limited at ~60 requests\/min. The pipeline handles retries, but if you see <code>429<\/code> errors, check your <a href=\"https:\/\/cloud.google.com\/vertex-ai\">Vertex AI<\/a> quota dashboard and request an increase before re-running.<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> is not deterministic. Runs with the same data can produce slightly different 2D layouts unless you pin <code>random_state<\/code>. The pipeline does pin it, but if you fork and forget, your clusters will shift between runs. However, relative distances between points do not change, so interpretation will remain the same.<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><a href=\"https:\/\/docs.lancedb.com\/quickstart\">LanceDB<\/a> lock files. If a previous run crashed mid-write, <a href=\"https:\/\/docs.lancedb.com\/quickstart\">LanceDB<\/a> may leave a <code>.lock<\/code> file that blocks the next run. Delete <code>*.lock<\/code> files in the <a href=\"https:\/\/docs.lancedb.com\/quickstart\">LanceDB<\/a> directory if the pipeline hangs on startup.<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><strong><code>uv sync<\/code> vs <code>pip install<\/code>.<\/strong> The project uses <code>uv<\/code> for dependency management. If you install via <code>pip<\/code> instead, <code>hdbscan<\/code> and <code>umap-learn<\/code> can pull conflicting <code>numpy<\/code> versions. Stick with <code>uv sync<\/code>. <\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:heading --><\/p>\n<h2 class=\"wp-block-heading\"><strong>3. Data model<\/strong><\/h2>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":false,\"fontSize\":\"medium\"} --><\/p>\n<figure class=\"wp-block-table has-medium-font-size\">\n<table>\n<thead>\n<tr>\n<th>Table \/ Entity<\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>brands<\/code><\/td>\n<td>Master list of <strong>200+<\/strong> brands (internal) or <strong>20 brands<\/strong> (toy dataset) with metadata (industry, country)<\/td>\n<\/tr>\n<tr>\n<td><code>perception_signals<\/code><\/td>\n<td>One row per (brand \u00d7 sensor) \u2014 raw text summaries from each source<\/td>\n<\/tr>\n<tr>\n<td><code>embeddings<\/code><\/td>\n<td>Semantic vector per perception signal \u2014 <a href=\"https:\/\/ai.google.dev\/gemini-api\/docs\/embeddings\"><code>gemini-embedding-001<\/code><\/a>, <strong>768 dimensions<\/strong><\/td>\n<\/tr>\n<tr>\n<td><code>umap_projections<\/code><\/td>\n<td>2D coordinates per embedding after <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> reduction<\/td>\n<\/tr>\n<tr>\n<td><code>clusters<\/code><\/td>\n<td>Cluster ID, 3-word label (e.g. &#8220;Hope Innovation Compassion&#8221;), and member brands<\/td>\n<\/tr>\n<tr>\n<td><code>bav_attributes<\/code><\/td>\n<td>48 BAV imagery attribute scores per brand \u00d7 12 audience segments<\/td>\n<\/tr>\n<tr>\n<td><code>consistency_scores<\/code><\/td>\n<td>Omnichannel consistency % per brand (mean distance to centroid across sensors)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><em>The data model below describes what lives in <\/em><a href=\"https:\/\/docs.lancedb.com\/quickstart\">LanceDB<\/a><em> after the pipeline runs. Each table feeds a different part of the dashboard UI. The key thing to understand: a single brand has multiple rows across these tables (one per sensor, one per audience segment, etc.). The perception map plots one dot per row in<\/em><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><em>umap_projections<\/em><em>, not one dot per brand.<\/em><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Toy dataset format<\/strong> (CSV, used by public GitHub default mode):<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":false,\"fontSize\":\"medium\"} --><\/p>\n<figure class=\"wp-block-table has-medium-font-size\">\n<table>\n<thead>\n<tr>\n<th>Column<\/th>\n<th>Description<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>Brand<\/code><\/td>\n<td>Brand name (e.g. Aetherium, Zenith Dynamics)<\/td>\n<\/tr>\n<tr>\n<td><code>Industry<\/code><\/td>\n<td>Industry category (Technology, Automotive, Food &amp; Beverage, Retail, Healthcare, Finance, Entertainment)<\/td>\n<\/tr>\n<tr>\n<td><code>Platform<\/code><\/td>\n<td>Source sensor: TikTok (Brand Known), TikTok (Brand Unknown), Instagram (Brand Known), Instagram (Brand Unknown), Wikipedia, LLM, Survey<\/td>\n<\/tr>\n<tr>\n<td><code>Survey_Audience<\/code><\/td>\n<td>Demographic segment for Survey rows (e.g. Tech Early Adopters, Gen Z &#8211; Gamers); N\/A for non-survey<\/td>\n<\/tr>\n<tr>\n<td><code>Brand_Perception_Report<\/code><\/td>\n<td>Free-text perception summary<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading --><\/p>\n<h2 class=\"wp-block-heading\"><strong>4. Pipeline \/ workflow<\/strong><\/h2>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><em>The table below is the quick-reference version. Commentary after it explains what&#8217;s actually happening at each stage and where things can go wrong.<\/em><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":false,\"fontSize\":\"medium\"} --><\/p>\n<figure class=\"wp-block-table has-medium-font-size\">\n<table>\n<thead>\n<tr>\n<th>Phase<\/th>\n<th>Key numbers<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>1. Preprocess<\/strong><\/td>\n<td>Instant runtime, merges into 1 unified DataFrame<\/td>\n<\/tr>\n<tr>\n<td><strong>2. Ingest (Embed)<\/strong><\/td>\n<td><a href=\"https:\/\/ai.google.dev\/gemini-api\/docs\/embeddings\"><code>gemini-embedding-001<\/code><\/a>, 768 dims, batches of 50 via <a href=\"https:\/\/cloud.google.com\/vertex-ai\">Vertex<\/a> APIs<\/td>\n<\/tr>\n<tr>\n<td><strong>3. Aggregate<\/strong><\/td>\n<td>Procrustes rotation on anchor brands; <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> <code>n_neighbors=min(15, len-1)<\/code>, <code>min_dist=0.1<\/code>, &lt;5 sec<\/td>\n<\/tr>\n<tr>\n<td><strong>4. Cluster<\/strong><\/td>\n<td><a href=\"https:\/\/hdbscan.readthedocs.io\/en\/latest\/\">HDBSCAN<\/a> (<code>min_cluster_size=3\/5<\/code>). Several minutes via <a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a> Batch labelling queue<\/td>\n<\/tr>\n<tr>\n<td><strong>5. Consistency<\/strong><\/td>\n<td><code>max(0.0, min(100.0, 100.0 - (mean_dist_to_centroid * 35.0)))<\/code><\/td>\n<\/tr>\n<tr>\n<td><strong>6. BAV join<\/strong><\/td>\n<td><a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.linalg.orthogonal_procrustes.html\">Procrustes alignment<\/a> via rotation matrix on overlapping anchor brands<\/td>\n<\/tr>\n<tr>\n<td><strong>7. Atlas UI<\/strong><\/td>\n<td><a href=\"https:\/\/streamlit.io\/\">Streamlit<\/a> (instant runtime)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":4,\"className\":\"is-style-text-subtitle\"} --><\/p>\n<h4 class=\"wp-block-heading is-style-text-subtitle\"><strong>Phase-by-phase commentary<\/strong><\/h4>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Phase 1 \u2014 Preprocess.<\/strong> Deceptively simple: merge CSVs, normalise brand names, filter junk. The hidden complexity is <strong>name matching<\/strong>. <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> uses official corporate names, social media uses colloquial names, and <a href=\"https:\/\/www.wikipedia.org\/\">Wikipedia<\/a> uses yet another variant. <code>preprocess.py<\/code> maintains a manual alias map for this. If you add a new brand and it doesn&#8217;t appear on the map, check the alias map first \u2014 it&#8217;s almost always a name mismatch.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Phase 2 \u2014 Ingest (embed).<\/strong> Each <code>Brand_Perception_Report<\/code> text gets turned into a 768-dimensional vector via <a href=\"https:\/\/ai.google.dev\/gemini-api\/docs\/embeddings\"><code>gemini-embedding-001<\/code><\/a>. The critical thing to understand: <em>these vectors encode writing style as much as meaning<\/em>. A <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> report that says &#8220;Helpful (45.2), Reliable (38.1)&#8221; and a <a href=\"https:\/\/www.tiktok.com\/\">TikTok<\/a> report that says &#8220;this brand gives cozy reliable vibes&#8221; will land in <strong>different<\/strong> regions of embedding space even though they describe similar perceptions. This is the root cause of the domain shift problem solved in Phase 3.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Phase 3 \u2014 Aggregate (the hard one).<\/strong> This is where most of the iteration happened. Three things occur in sequence:<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:list {\"ordered\":true} --><\/p>\n<ol class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Social aggregation:<\/strong> Multiple post-level embeddings per (Brand \u00d7 Platform) are mean-averaged into a single vector, and <a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a> generates a summary report. This smoothing pulls social vectors toward a shared centroid.<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><strong>Procrustes alignment:<\/strong> The <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> vectors are rotated into the social embedding subspace using 202 shared anchor brands (see Section 6 for the full story).<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> projection: The combined, aligned vectors are reduced to 2D. Only the <code>'All Adults'<\/code> <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> slice is fitted alongside social platforms \u2014 this prevents the 12 <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> demographic segments from dominating the topology.<\/li>\n<p><!-- \/wp:list-item --><\/ol>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>\ud83d\udd2c <strong>Why &#8220;balanced subset fit&#8221; matters.<\/strong> <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> has 12 audience segments per brand. Social has ~1\u20134 data points per brand. Without balancing, <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> sees 12\u00d7 more <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> points and builds its neighbourhood graph around <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> structure, marginalising social data. The fix: fit <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> on the balanced subset (All Adults + social), then <em>transform<\/em> the remaining 11 <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> segments passively. This was a non-obvious but critical design choice.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Phase 4 \u2014 Cluster.<\/strong> <a href=\"https:\/\/hdbscan.readthedocs.io\/en\/latest\/\">HDBSCAN<\/a> groups nearby points into perception themes, then <a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a> labels each cluster with exactly 3 words. <code>min_cluster_size<\/code> is set to 3 (toy dataset) or 5 (full dataset). The batch labelling step submits each cluster&#8217;s centroid + top 5 most similar reports to <a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a>. This can take several minutes because it goes through the <a href=\"https:\/\/cloud.google.com\/vertex-ai\">Vertex AI<\/a> Batch queue \u2014 don&#8217;t assume it hung.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Phase 5 \u2014 Consistency<\/strong>. A simple but effective metric: for each brand, compute the mean Euclidean distance from each sensor&#8217;s point to the brand&#8217;s centroid, then invert and scale. Brands where all sensors agree (Brand D, Brand E) score 99%+. Brands with platform-dependent perception (Brand G) score much lower. The <strong>* 35.0<\/strong> scaling factor was empirically tuned to spread scores across a useful range.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Phase 6 \u2014 <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> join.<\/strong> Brings in the raw 48-attribute <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> scores per demographic segment. These are the structured numbers (not the <a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a>-generated prose) and power the &#8220;Survey Audience&#8221; filter in the dashboard.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Phase 7 \u2014 Atlas UI.<\/strong> <a href=\"https:\/\/streamlit.io\/\">Streamlit<\/a> renders everything from <a href=\"https:\/\/docs.lancedb.com\/quickstart\">LanceDB<\/a>. Instant startup because all computation was done in previous phases.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading --><\/p>\n<h2 class=\"wp-block-heading\"><strong>5. Atlas interface (primary interactions)<\/strong><\/h2>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Focus brand selector<\/strong> \u2192 perception map + sidebar with cluster label, per-sensor summaries<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><strong>Survey audience filter (<a href=\"https:\/\/wppbav.com\/\">BAV<\/a>)<\/strong> \u2192 12 demographic segments<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><strong>Number of neighbours slider<\/strong> \u2192 controls perceptual neighbors on the map<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><strong>Reference platform selector<\/strong> \u2192 changes cross-modal overlap anchor (<a href=\"https:\/\/www.wikipedia.org\/\">Wikipedia<\/a>, <a href=\"https:\/\/wppbav.com\/\">BAV<\/a>, etc.)<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><strong>Competitor set toggle<\/strong> \u2192 <code>Show unexpected neighbours<\/code> (out-of-industry brands)<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p> \ud83c\udfaf <strong>What to look for when using the Atlas.<\/strong> The most interesting insights come from <em>disagreements<\/em> between sensors. If a brand&#8217;s <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> dot and <a href=\"https:\/\/www.tiktok.com\/\">TikTok<\/a> dot are far apart, that&#8217;s a signal: the structured survey perception (what people <em>say<\/em> when asked directly) differs from the organic social perception (what people <em>actually talk about<\/em>). The Brand vs Content Effect tab adds another layer \u2014 when you hide the brand name from social content, does the perception shift? If so, the brand&#8217;s reputation is doing heavy lifting independent of the product itself.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading --><\/p>\n<h2 class=\"wp-block-heading\"><strong>6. Domain-specific mechanics<\/strong><\/h2>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\">6.1 Why <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> is &#8220;ground truth&#8221;<\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Survey-based, 48 structured imagery attributes, 30+ years of longitudinal data, 12 demographic segments. <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> captures deep-seated beliefs shielded from daily social flux. Social algorithms change daily; survey-based trait grids collected over decades establish structured cognitive associations completely shielded from current hype timelines.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>\ud83d\udcd6 <strong>For non-specialists:<\/strong> WPP <a href=\"https:\/\/wppbav.com\/\">(BAV)<\/a> is one of the largest brand research databases in the world, maintained by WPP. It works by asking thousands of consumers to rate brands on 48 specific attributes \u2014 things like &#8220;Helpful,&#8221; &#8220;Innovative,&#8221; &#8220;Trustworthy&#8221; \u2014 scored on a numeric scale. Because the same questions are asked year after year across demographic segments, <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> gives you a stable, structured snapshot of how people <em>think<\/em> about a brand when prompted. Social media gives you what people <em>spontaneously say<\/em>. These are fundamentally different signals, and combining them is the core challenge of this project.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\">6.2 The <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> alignment problem &#8211; a full account<\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><em>This section documents the central technical challenge of the Atlas and the iterative process that solved it.<\/em><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":4,\"className\":\"is-style-default\"} --><\/p>\n<h4 class=\"wp-block-heading is-style-default\">The problem<\/h4>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>When the Atlas was first built, the <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> perception map split cleanly down the middle: all <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> dots on the left, all social dots on the right, regardless of whether they described the exact same brand. Brand A&#8217;s <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> point and Brand A&#8217;s <a href=\"https:\/\/www.tiktok.com\/\">TikTok<\/a> point would be in completely different regions of the map. This made the entire visualisation useless for cross-platform comparison.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>The separation was <strong>not<\/strong> evidence that <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> survey data captures genuinely different brand perceptions from social media. It was a <strong>methodological artefact<\/strong> caused by two compounding issues in the pipeline.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":4} --><\/p>\n<h4 class=\"wp-block-heading\">Root cause: different text domains fed to the same embedding model<\/h4>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>All platforms are embedded with the same model (<a href=\"https:\/\/ai.google.dev\/gemini-api\/docs\/embeddings\"><code>gemini-embedding-001<\/code><\/a>), but the text being embedded is fundamentally different in style, vocabulary, and structure:<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":false,\"fontSize\":\"medium\"} --><\/p>\n<figure class=\"wp-block-table has-medium-font-size\">\n<table>\n<thead>\n<tr>\n<th>Platform<\/th>\n<th><code>Brand_Perception_Report<\/code> content<\/th>\n<th>Source<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><a href=\"https:\/\/wppbav.com\/\"><strong>BAV<\/strong><\/a><\/td>\n<td>LLM-generated narrative from 48 numerical imagery sensors, e.g. <em>&#8220;Helpful (45.2), Reliable (38.1)\u2026&#8221;<\/em> \u2192 Gemini prompt \u2192 prose paragraph<\/td>\n<td><code>analysis.py<\/code>, <code>preprocess.py<\/code><\/td>\n<\/tr>\n<tr>\n<td><strong><a href=\"https:\/\/www.tiktok.com\/\">TikTok<\/a> \/ <a href=\"https:\/\/www.instagram.com\/\">Instagram<\/a><\/strong><\/td>\n<td>LLM-generated perception report from watching a single video\/post<\/td>\n<td><code>preprocess.py<\/code><\/td>\n<\/tr>\n<tr>\n<td><a href=\"https:\/\/www.wikipedia.org\/\"><strong>Wikipedia<\/strong><\/a><\/td>\n<td>LLM-generated perception from <a href=\"https:\/\/www.wikipedia.org\/\">Wikipedia<\/a> article text<\/td>\n<td><code>preprocess.py<\/code><\/td>\n<\/tr>\n<tr>\n<td><strong>LLM (<a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a>)<\/strong><\/td>\n<td>Direct LLM perception (<a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a> asked &#8220;what do you think of brand X?&#8221;)<\/td>\n<td>Same as Wiki source file<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>The <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> text originates from a <strong>double LLM transformation<\/strong>: raw survey numbers \u2192 <code>generate_semantic_statement()<\/code> (structured string like <em>&#8220;Full <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> Imagery Profile (48 Sensors): Helpful (45.2), Reliable (38.1), \u2026&#8221;<\/em>) \u2192 <a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a> prompt \u2192 narrative paragraph. The social text comes from a <strong>single LLM step<\/strong> interpreting raw video\/post content directly.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>This means the embedding model sees completely different linguistic distributions for <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> vs social. The <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> narratives share a common templated style (always referencing &#8220;imagery sensors,&#8221; &#8220;quantitative data,&#8221; survey language) while social narratives use informal, media-oriented language. <strong>Embedding models encode <em>how<\/em> something is said as much as <em>what<\/em> is said<\/strong>, so this systematic style difference pushes <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> vectors into a distinct cluster regardless of actual brand perception agreement.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":4} --><\/p>\n<h4 class=\"wp-block-heading\">Compounding factor 1: aggregation asymmetry<\/h4>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>In <code>aggregate.py<\/code>, social data (<a href=\"https:\/\/www.tiktok.com\/\">TikTok<\/a>, <a href=\"https:\/\/www.instagram.com\/\">Instagram<\/a>) has multiple post-level embeddings that are <strong>mean-averaged<\/strong> per (Brand, Platform), and a <a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a> summary replaces the report text. This smoothing pulls social vectors toward a shared centroid. <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> \/ <a href=\"https:\/\/www.wikipedia.org\/\">Wikipedia<\/a> \/ <a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a> data (<code>df_research<\/code>) passes through as-is with <code>post_count = 1<\/code> \u2014 no averaging occurs.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>The result: social embeddings are inherently more &#8220;central&#8221; (mean-regression effect), while <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> embeddings retain their full individual variance. When <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> runs on this combined set, the social vectors cluster tighter and the un-averaged <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> vectors spread out differently.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":4} --><\/p>\n<h4 class=\"wp-block-heading\">Compounding factor 2: UMAP sees the domain gap<\/h4>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> is run on the entire combined dataset with <code>n_neighbors=15<\/code>. Because <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> embeddings share a systematic style signature different from social embeddings, <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a>&#8216;s neighbourhood graph naturally groups them apart \u2014 it finds the <strong>text-style cluster<\/strong>, not a genuine <strong>perception cluster<\/strong>.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":4} --><\/p>\n<h4 class=\"wp-block-heading\">What we tried (in order)<\/h4>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>\ud83d\udd2c <strong>Option A \u2014 normalise the text domain (first attempt).<\/strong> Rewrote the <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> perception report generation prompt in <code>analysis.py<\/code> to produce output that mimics the style of a social\/<a href=\"https:\/\/www.wikipedia.org\/\">Wikipedia<\/a> perception report. Specifically: removed references to &#8220;<a href=\"https:\/\/wppbav.com\/\">BAV<\/a>,&#8221; &#8220;imagery sensors,&#8221; &#8220;quantitative data&#8221; from the prompt. Used the same persona\/format instructions as the social platform reports \u2014 describing <em>what the brand feels like<\/em> rather than referencing the data source. The prompt became: <em>&#8220;Based on the following consumer perception data, write a concise paragraph describing how this brand is perceived by consumers. Focus on: overall vibe, what people praise, what people criticise, and who the typical customer is. Write as if describing public perception \u2014 do not reference the data source or format.&#8221;<\/em><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Result:<\/strong> Reduced but <strong>did not eliminate<\/strong> the <a href=\"https:\/\/wppbav.com\/\">BAV<\/a>\/social separation. The templated numerical origin still leaked through in subtle ways.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>\u2705<strong>Option B \u2014 Procrustes alignment (the solution).<\/strong> Use <a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.linalg.orthogonal_procrustes.html\"><code>scipy.linalg.orthogonal_procrustes<\/code><\/a> to align the <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> embedding subspace to the social subspace before combining. This preserves within-platform structure while removing the cross-platform domain shift. <strong>This is what the pipeline uses today.<\/strong><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><em>Option C (embed a standardised perception schema across all platforms) remains a potential future improvement but requires significantly more work.<\/em><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>\u274c <strong><s>Option D \u2014 Per-platform z-score normalisation (ruled out).<\/s><\/strong> Apply per-platform z-score normalisation to embedding vectors before <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a>, centring each platform&#8217;s distribution to zero-mean and unit-variance. This would remove the systematic offset but also <strong>mask any genuine platform-level differences<\/strong> \u2014 making it a workaround, not a proper fix. Discarded.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":4} --><\/p>\n<h4 class=\"wp-block-heading\">How Procrustes alignment works in the pipeline<\/h4>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>\ud83d\udcd6<strong>For non-specialists:<\/strong> Procrustes alignment is a mathematical technique from shape analysis, named after a figure in Greek mythology who stretched or cut people to fit his bed. In our context, it &#8220;stretches&#8221; one cloud of data points to best overlap with another. Critically, it only uses <strong>rotation<\/strong> (spinning) and <strong>scaling<\/strong> \u2014 it doesn&#8217;t distort the internal relationships between points within each cloud. So the relative positions of <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> brands among themselves are preserved, but the entire <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> cloud is repositioned to overlap with the social cloud.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Here&#8217;s exactly what the code in <code>aggregate.py<\/code> does:<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:list {\"ordered\":true} --><\/p>\n<ol class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Finds the anchors.<\/strong> Identifies every brand that exists in <em>both<\/em> the <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> data and the social data (e.g., Brand A exists in both). In the production pipeline log, this found <strong>202 anchor brands<\/strong>.<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><strong>Computes the transformation.<\/strong>For each anchor brand, computes the mean social vector across all its social platforms, centres both the <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> anchor vectors and the social anchor vectors, and then uses <a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.linalg.orthogonal_procrustes.html\"><code>scipy.linalg.orthogonal_procrustes<\/code><\/a> to find the optimal <strong>rotation matrix R<\/strong> that maps <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> \u2192 social.<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><strong>Applies the rotation.<\/strong> Multiplies <em>all<\/em> <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> vectors (even brands that didn&#8217;t have social data) by this rotation matrix, moving the entire <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> dataset into the social media spatial domain.<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><strong>Logs quality.<\/strong> Reports the Frobenius residual so alignment quality is traceable.<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><strong>Safeguard.<\/strong> If fewer than 10 shared brands exist, alignment is skipped with a warning \u2014 Procrustes is unreliable with too few anchors.<\/li>\n<p><!-- \/wp:list-item --><\/ol>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:heading {\"level\":4} --><\/p>\n<h4 class=\"wp-block-heading\">How this changes interpretation of the dashboard<\/h4>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>This alignment profoundly upgrades what you can conclude from the perception map:<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>True cross-platform comparisons.<\/strong> If a <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> dot and a <a href=\"https:\/\/www.tiktok.com\/\">TikTok<\/a> dot for &#8220;Brand B&#8221; sit right next to each other, it now genuinely implies that the core sentiment in the structured survey data closely matches the organic social conversations. Before Procrustes, proximity between <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> and social points was meaningless.<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><strong>Distances have semantic meaning.<\/strong> By forcibly removing the structural domain shift, any remaining distance between two points is entirely due to a difference in <em>meaning and perception<\/em>. If the <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> point for a brand is far from its Instagram point, you can confidently analyse that gap as a genuine difference in audience perception or marketing strategy \u2014 not just an artefact of different text formatting.<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><strong>Unified clustering.<\/strong> When <a href=\"https:\/\/hdbscan.readthedocs.io\/en\/latest\/\">HDBSCAN<\/a> runs over this aligned space, it can finally cluster <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> reports together with social reports. The LLM-generated theme labels now encompass insights drawn from both quantitative surveys and viral videos simultaneously.<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\">6.3 UMAP parameter sensitivity<\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><code>min_dist=0.1<\/code>, <code>n_components=2<\/code>, <code>metric='cosine'<\/code>. <strong>Balanced subset fit<\/strong>: only <code>'All Adults'<\/code> <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> slice is fitted alongside social platforms, preventing 12 <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> audiences from overpowering topology. <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> was chosen over t-SNE because it enables saving the <code>reducer<\/code> object \u2014 the pipeline strictly <em>fits<\/em> on one balanced subset and passively <em>transforms<\/em> newly injected demographics.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>\u26a0\ufe0f <strong>Watch out:<\/strong> Changing <code>min_dist<\/code> has outsized effects on the map. Lower values (e.g., 0.01) create tighter, more separated clusters \u2014 visually dramatic but can split genuinely related brands. Higher values (e.g., 0.5) spread everything into a uniform blob. The current <code>0.1<\/code> was chosen as a balance after visual inspection across multiple brand sets. If you change it, re-check whether brands with known perceptual similarity (e.g., brands in the same industry) still land in the same neighbourhood.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\">6.4 Cluster labelling<\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Automated via LLM. <a href=\"https:\/\/cloud.google.com\/vertex-ai\">Vertex AI<\/a> Batch submits centroid + top 5 reports (by cosine similarity) to <a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a> \u2192 <strong>exactly 3 words<\/strong>. The 3-word constraint forces abstraction \u2014 &#8220;Hope Innovation Compassion&#8221; rather than a paragraph. If labels feel wrong, the issue is almost always that the cluster itself is incoherent (check <a href=\"https:\/\/hdbscan.readthedocs.io\/en\/latest\/\">HDBSCAN<\/a>&#8216;s <code>min_cluster_size<\/code>), not that the LLM mislabelled it.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\">6.5 Omnichannel consistency<\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><code>100.0 - (mean Euclidean distance * 35.0)<\/code>, clamped 0\u2013100%. Tight overlaps hit 99%+ (Brand D, Brand E), dispersed shifts drop fast (Brand G). The <code>* 35.0<\/code> multiplier is an empirically tuned scaling factor \u2014 if you add new sensors or change the embedding model, you may need to recalibrate it so scores distribute meaningfully across 0\u2013100%.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading {\"level\":3} --><\/p>\n<h3 class=\"wp-block-heading\">6.6 Content vs brand effect methodology<\/h3>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>Tracks <code>shift_2d<\/code> (Euclidean magnitude), <code>cos_shift<\/code> (cosine diff between brand known\/brand unknown), and <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> baseline deltas (<code>bav_delta_known<\/code> vs <code>bav_delta_unk<\/code>). Parses exact LLM cluster words added\/lost due to brand awareness.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>\ud83d\udca1 <strong>Why this matters.<\/strong> Social media perception reports are generated from video\/post content. When the brand name is visible, the LLM&#8217;s perception is coloured by everything it &#8220;knows&#8221; about that brand. When the brand name is hidden, the LLM can only react to what it actually sees in the content. The delta between these two tells you how much of a brand&#8217;s social perception is driven by <em>brand reputation<\/em> vs <em>actual content quality<\/em>. Large shifts indicate the brand name is doing heavy lifting.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading --><\/p>\n<h2 class=\"wp-block-heading\"><strong>7. Module map<\/strong><\/h2>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><em>The codebase is intentionally small. Every module does one thing. If you&#8217;re debugging, start by identifying which phase failed (check the CLI output), then go straight to the corresponding file.<\/em><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:code --><\/p>\n<pre class=\"wp-block-code\"><code>brand_perception\/dashboard\/atlas_pipeline\/\n\u251c\u2500\u2500 main.py              (43 lines)    CLI entry point\n\u251c\u2500\u2500 dashboard_v1.py      (~1133 lines) Core Streamlit frontend\n\u2514\u2500\u2500 src\/pipeline\/\n    \u251c\u2500\u2500 preprocess.py    (130 lines)   Sanitizes, normalises, filters into LanceDB schemas\n    \u251c\u2500\u2500 ingest.py        (126 lines)   GenAI models \u2192 768-D embeddings\n    \u251c\u2500\u2500 aggregate.py     (303 lines)   Procrustes, LLM reports, UMAP layouts\n    \u2514\u2500\u2500 cluster.py       (244 lines)   HDBSCAN groups + batch cluster labels\n<\/code><\/pre>\n<p><!-- \/wp:code --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>Total:<\/strong> ~1,979 lines across 6 modules.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>\ud83d\udcd6 <strong>Where the complexity lives.<\/strong> The line counts understate the complexity. <code>aggregate.py<\/code> at 303 lines is where 80% of the intellectual difficulty sits \u2014 it handles social aggregation, Procrustes alignment, and <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> projection. <code>dashboard_v1.py<\/code> at ~1,133 lines is the largest file but is mostly <a href=\"https:\/\/streamlit.io\/\">Streamlit<\/a> layout code. If you&#8217;re onboarding, read <code>aggregate.py<\/code> first; it&#8217;s where the science meets the engineering.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading --><\/p>\n<h2 class=\"wp-block-heading\"><strong>8. Test coverage<\/strong><\/h2>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>~6 functional \/ integration tests. Runtime &lt;1 minute.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":false,\"fontSize\":\"medium\"} --><\/p>\n<figure class=\"wp-block-table has-medium-font-size\">\n<table>\n<thead>\n<tr>\n<th>Test file<\/th>\n<th>Purpose<\/th>\n<th>Count<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><code>scripts\/test_bav_pipeline.py<\/code><\/td>\n<td>BAV baseline ingestion flow<\/td>\n<td>1<\/td>\n<\/tr>\n<tr>\n<td><code>dev\/test_apify*<\/code><\/td>\n<td>Social Scraper (TikTok\/Instagram)<\/td>\n<td>2<\/td>\n<\/tr>\n<tr>\n<td><code>brand_perception\/api\/test_agent.py<\/code><\/td>\n<td>Job Orchestrator backend queues<\/td>\n<td>2<\/td>\n<\/tr>\n<tr>\n<td><code>research\/test_scrape_jh.py<\/code><\/td>\n<td>Manual methodology mockups<\/td>\n<td>1<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading --><\/p>\n<h2 class=\"wp-block-heading\"><strong>9. Design decisions<\/strong><\/h2>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><em>These are the answers to questions that came up during development where the wrong choice would have broken the system.<\/em><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:table {\"hasFixedLayout\":false,\"fontSize\":\"small\"} --><\/p>\n<figure class=\"wp-block-table has-small-font-size\">\n<table>\n<thead>\n<tr>\n<th>ID<\/th>\n<th>Decision<\/th>\n<th>Rationale<\/th>\n<th>What happens if you reverse it<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>DD-BPA-1<\/td>\n<td>UMAP over t-SNE<\/td>\n<td>Better distance proportionality; enables saving <code>reducer<\/code> to <em>fit<\/em> on balanced subset and <em>transform<\/em> new demographics<\/td>\n<td>t-SNE can&#8217;t transform new points \u2014 you&#8217;d have to re-run the entire projection every time a new BAV demographic segment is added, and distances between clusters become meaningless<\/td>\n<\/tr>\n<tr>\n<td>DD-BPA-2<\/td>\n<td>BAV as ground-truth anchor<\/td>\n<td>Survey-based trait grids over decades, shielded from daily social hype<\/td>\n<td>Using social as ground truth would anchor the map to volatile, algorithm-dependent signals \u2014 the map would shift with every TikTok trend cycle<\/td>\n<\/tr>\n<tr>\n<td>DD-BPA-3<\/td>\n<td>Semantic embeddings over keywords<\/td>\n<td>Captures meaning (&#8220;Luxury&#8221; \u2248 &#8220;Premium&#8221; \u2248 &#8220;Prestigious&#8221;)<\/td>\n<td>Keyword-based approaches treat &#8220;Luxury&#8221; and &#8220;Premium&#8221; as unrelated tokens \u2014 brands described with different vocabulary but identical perception would never cluster together<\/td>\n<\/tr>\n<tr>\n<td>DD-BPA-4<\/td>\n<td>Procrustes alignment<\/td>\n<td>Solves text heterogeneity (surveys vs social) via rotation on anchor brands. Prompt normalisation alone (Option A) reduced but did not eliminate the domain gap<\/td>\n<td>Without it, the map splits by text style (BAV left, social right) rather than by brand perception \u2014 see Section 6.2 for the full account<\/td>\n<\/tr>\n<tr>\n<td>DD-BPA-5<\/td>\n<td>Local first public release (no API keys required)<\/td>\n<td>Lowers barrier to entry; toy dataset enables immediate exploration without infrastructure<\/td>\n<td>Requiring API keys upfront would prevent most people from ever trying the tool \u2014 the toy dataset lets someone see the full Atlas UI in under 5 minutes<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/figure>\n<p><!-- \/wp:table --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading --><\/p>\n<h2 class=\"wp-block-heading\"><strong>10. Extending the System<\/strong><\/h2>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><em><em>The pipeline was designed to be extended, each of these is a realistic next step, listed in order of effort.<\/em><\/em><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:list {\"ordered\":true} --><\/p>\n<ol class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Run with custom data<\/strong> \u2014 Format your own dataset as CSV matching the toy dataset schema (Brand, Industry, Platform, Survey_Audience, Brand_Perception_Report), drop into data directory, run default mode. This is the zero-effort way to test the Atlas on a new domain.<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><strong>Add a sensor<\/strong> \u2014 Collect as CSV\/Parquet, import in <code>preprocess.py<\/code> under <code>FINAL_COLUMN_ORDER<\/code> (<code>Super_Platform<\/code>, <code>Year<\/code>, <code>Brand<\/code>, <code>Raw_Text<\/code>), run <code>main.py<\/code>. The Procrustes alignment will automatically include the new sensor in its anchor calculation if the new sensor shares brands with existing sources.<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><strong>Add a market<\/strong> \u2014 Reroll <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> datasets in <code>.\/paths<\/code>, override GCS env vars, rerun batches. Note that Procrustes alignment quality depends on having enough shared anchor brands between <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> and social data \u2014 if you enter a market where <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> coverage is thin, check the Frobenius residual in the logs.<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><strong>Temporal tracking<\/strong> \u2014 <a href=\"https:\/\/docs.lancedb.com\/quickstart\">LanceDB<\/a> already stores <code>Year<\/code> and <code>BAV_Study<\/code>; add slide toggle in <code>dashboard_v1.py<\/code>. This would let you see how a brand&#8217;s perception drifts over time across sensors \u2014 one of the most requested features.<\/li>\n<p><!-- \/wp:list-item --><\/ol>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>\u26a0\ufe0f <strong>If you add a new sensor:<\/strong> remember that the Procrustes alignment currently rotates <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> into the social subspace specifically. If your new sensor has a similarly distinct text style (e.g., Reddit comments vs <a href=\"https:\/\/www.tiktok.com\/\">TikTok<\/a> captions), you may see a new domain gap. In that case, consider extending the alignment step to handle multiple source-target pairs, or grouping sensors into &#8220;formal&#8221; and &#8220;informal&#8221; categories for alignment.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:heading --><\/p>\n<h2 class=\"wp-block-heading\"><strong>11. Results (end-to-end validation)<\/strong><\/h2>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:list --><\/p>\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Internal coverage:<\/strong> 200+ brands, 4,000+ data points across 5 modalities, 12 demographic segments<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><strong>Public toy dataset:<\/strong> 20 brands, 681 perception reports across 7 industries, all 5 sensor types + brand known\/brand unknown variants<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><strong>Cross-industry insight validation:<\/strong><!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item --><\/p>\n<li><strong>Omnichannel consistency:<\/strong> Brand D, Brand E at 99%+; Brand G identified as multi-faceted<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><strong>Shared equity, different vibe:<\/strong> Brand H \u2194 Brand G (close on <a href=\"https:\/\/wppbav.com\/\">BAV<\/a>, far on socials)<\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><strong>Different equity, shared vibe:<\/strong> Brand K \u2194 Brand L (far on <a href=\"https:\/\/wppbav.com\/\">BAV<\/a>, converged on socials)<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/li>\n<p><!-- \/wp:list-item --><\/p>\n<p><!-- wp:list-item --><\/p>\n<li><strong>Validation metrics:<\/strong> Procrustes Residuals for subspace overlap + brand known\/brand unknown cosine similarity differentials<\/li>\n<p><!-- \/wp:list-item --><\/ul>\n<p><!-- \/wp:list --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p>\ud83d\udca1<strong>How to read these results.<\/strong> The &#8220;shared equity, different vibe&#8221; and &#8220;different equity, shared vibe&#8221; patterns are the most commercially interesting findings. They reveal cases where a brand&#8217;s formal positioning (<a href=\"https:\/\/wppbav.com\/\">BAV<\/a>) disagrees with its organic social presence \u2014 exactly the kind of insight that&#8217;s invisible to either data source alone. The Atlas&#8217;s value proposition is making these cross-modal disagreements visible and quantifiable.<\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:separator --><\/p>\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator --><\/p>\n<p><!-- wp:heading --><\/p>\n<h2 class=\"wp-block-heading\">\ud83d\udd17 Repositories<\/h2>\n<p><!-- \/wp:heading --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><strong>GitHub<\/strong>: <em>(GitHub repo coming soon)<\/em><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n<p><!-- wp:paragraph --><\/p>\n<p><!-- \/wp:paragraph --><\/p>\n","related_pods":[120],"content_quarter":"Q1 2026"},"research_categories":[],"raw_acf":{"content":"<!-- wp:paragraph -->\n<p><em>Brand leaders today are navigating without a map. Social media says one thing, surveys say another, and AI-generated content adds yet another layer, leaving strategists with no reliable way to know whether their brand\u2019s identity is consistent or not across channels. The Brand Perception Atlas solves this by bringing together five different sources of brand perception, TikTok, Instagram, Wikipedia, AI-generated summaries, and WPP\u2019s 30-year Brand Asset Valuator\u00ae (BAV) survey, into a single, visual map covering over 200 brands and 4,000+ data points. By placing all of these different signals on one comparable scale, it makes it possible for the first time to see how a brand looks across platforms side by side. Several notable patterns emerge from the analysis, revealing that some brands, e.g. Brand D, an industrial equipment manufacturer, maintain a unified identity everywhere. While others, e.g. Brand G, a global hospitality company, occupy distinct perceptual territories across platforms, that may reflect a deliberate strategy to engage different audiences. The study also identifies \"shared equity\" pairs, revealing brands in different industries that perform the same emotional job for consumers, despite looking different online.<\/em><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong><strong>If you don't care about the technical details, read <a href=\"https:\/\/research.wpp.com\/blog\/brand-perception-atlas-mapping-the-modern-brand-from-social-signal-to-core-equity\">our blog post <\/a>instead.<\/strong> The GitHub repo is also coming soon.<\/strong>  <\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":1} -->\n<h1 class=\"wp-block-heading\">The Brand Perception Atlas - a technical deep dive<\/h1>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>The <em>Brand Perception Atlas<\/em>, is an interactive decision-support tool that helps brand teams understand, compare, and explain brand perception across platforms. It combines embedding-space visualisation (UMAP) with interpretable clusters, and cross-platform consistency scoring. The result is a tool that moves teams from \"interesting maps\" to recommendations that are better documented and easier to explain, because they link back to specific underlying perception signals and clearly show where sources agree or disagree. These can be used in brand reviews, competitor analysis, and campaign planning.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><em>Goal: a newcomer can reproduce the core outcomes in 2\u20133 days. For full setup and running instructions, refer to the GitHub README <em>(GitHub repo coming soon<\/em>) \u2014 this walkthrough provides the conceptual map and domain knowledge needed to understand what the code does and why.<\/em><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>\ud83e\udded <strong>A note on what made this hard.<\/strong> The <em>Brand Perception Atlas<\/em> looks deceptively simple \u2014 embed text, project it, cluster it, display it. In reality, the single hardest problem was getting <strong>five fundamentally different perception sources<\/strong> to coexist in a shared space where distances actually mean something. WPP <a href=\"https:\/\/wppbav.com\/\">Brand Asset Valuator\u00ae (BAV)<\/a> data comes from structured survey scores transformed through an LLM (<a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a>) into prose. Social data comes from raw video transcripts and post captions, also transformed through an LLM (<a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a>). Even though both pass through the same embedding model, the <em>linguistic fingerprint<\/em> of each source dominates the resulting vectors \u2014 the map would split cleanly into \"survey-sounding text\" vs \"social-sounding text\" rather than grouping brands by actual perception. Solving this required iterating through prompt normalisation, aggregation rebalancing, and ultimately <strong>Procrustes alignment<\/strong> \u2014 a technique borrowed from shape analysis that rotates one embedding subspace onto another using shared anchor brands. Section 6 tells this story in full.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading -->\n<h2 class=\"wp-block-heading\"><strong>1. Architecture overview<\/strong><\/h2>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:image {\"id\":402,\"sizeSlug\":\"large\",\"linkDestination\":\"none\"} -->\n<figure class=\"wp-block-image size-large\"><img src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/03\/image-55-1024x559.png\" alt=\"\" class=\"wp-image-402\"\/><\/figure>\n<!-- \/wp:image -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Key constraint:<\/strong> All five perception sources must be projected into a <em>shared<\/em> embedding space so that distances are comparable across sensors. The pipeline uses <strong>Procrustes alignment<\/strong> to rotate <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> vectors into the social subspace via overlapping anchor brands, then <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> projects everything into 2D. The system maps the <em>ideas<\/em> behind the words, not just the text \u2014 \"Luxury,\" \"Premium,\" and \"Prestigious\" land in the same neighbourhood.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>\ud83d\udca1 <strong>Why this architecture isn't obvious.<\/strong> A naive approach would be: embed everything \u2192 <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> \u2192 cluster \u2192 done. The catch is that embedding models encode <em>how<\/em> something is said as much as <em>what<\/em> is said. A <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> narrative generated from \"Helpful (45.2), Reliable (38.1)\u2026\" reads nothing like a <a href=\"https:\/\/www.tiktok.com\/\">TikTok<\/a> perception report, even when both describe the same brand. Without the Procrustes step between \"Ingest\" and \"<a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a>,\" the map would split by <em>text style<\/em> rather than <em>brand perception<\/em>. The arrows in this diagram look linear, but getting Step 3 (Aggregate) right took more iteration than every other step combined.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading -->\n<h2 class=\"wp-block-heading\"><strong>2. Setup &amp; running<\/strong><\/h2>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p><em>Full install and deploy instructions are maintained in the README. Below is a summary for orientation.<\/em><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Public (GitHub ):<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table {\"hasFixedLayout\":false,\"fontSize\":\"medium\"} -->\n<figure class=\"wp-block-table has-medium-font-size\"><table><thead><tr><th>Mode<\/th><th>Description<\/th><th>API keys required?<\/th><\/tr><\/thead><tbody><tr><td><strong>Default<\/strong><\/td><td>Run locally with provided toy dataset (20 brands, 681 reports) or own dataset in same CSV format<\/td><td>No<\/td><\/tr><tr><td><strong>Advanced<\/strong><\/td><td>Plug in own API keys for LLM clustering and embedding models<\/td><td>Yes (plug in own API keys)<\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p><strong>Key dependencies:<\/strong> Python 3.12, <a href=\"https:\/\/docs.lancedb.com\/quickstart\"><code>lancedb<\/code><\/a>, <a href=\"https:\/\/pandas.pydata.org\/\"><code>pandas<\/code><\/a>, <a href=\"https:\/\/numpy.org\/\"><code>numpy<\/code><\/a>, <a href=\"https:\/\/docs.cloud.google.com\/python\/docs\/reference\/aiplatform\/latest\"><code>google-cloud-aiplatform<\/code><\/a>, <a href=\"https:\/\/docs.cloud.google.com\/vertex-ai\/generative-ai\/docs\/sdks\/overview\"><code>google-genai<\/code><\/a>, <a href=\"https:\/\/tqdm.github.io\/\"><code>tqdm<\/code><\/a>, <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\"><code>umap-learn<\/code><\/a>, <a href=\"https:\/\/hdbscan.readthedocs.io\/en\/latest\/\"><code>hdbscan<\/code><\/a>, <a href=\"https:\/\/scikit-learn.org\/stable\/\"><code>scikit-learn<\/code><\/a>, <a href=\"https:\/\/streamlit.io\/\"><code>streamlit<\/code><\/a><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>\u26a0\ufe0f <strong>Gotchas for newcomers:<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><a href=\"https:\/\/cloud.google.com\/vertex-ai\">Vertex AI<\/a> quotas. The embed step hits <a href=\"https:\/\/ai.google.dev\/gemini-api\/docs\/embeddings\"><code>gemini-embedding-001<\/code><\/a> in batches of 50. On a fresh <a href=\"https:\/\/cloud.google.com\/\">Google Cloud Project (GCP)<\/a> you may get rate-limited at ~60 requests\/min. The pipeline handles retries, but if you see <code>429<\/code> errors, check your <a href=\"https:\/\/cloud.google.com\/vertex-ai\">Vertex AI<\/a> quota dashboard and request an increase before re-running.<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> is not deterministic. Runs with the same data can produce slightly different 2D layouts unless you pin <code>random_state<\/code>. The pipeline does pin it, but if you fork and forget, your clusters will shift between runs. However, relative distances between points do not change, so interpretation will remain the same.<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><a href=\"https:\/\/docs.lancedb.com\/quickstart\">LanceDB<\/a> lock files. If a previous run crashed mid-write, <a href=\"https:\/\/docs.lancedb.com\/quickstart\">LanceDB<\/a> may leave a <code>.lock<\/code> file that blocks the next run. Delete <code>*.lock<\/code> files in the <a href=\"https:\/\/docs.lancedb.com\/quickstart\">LanceDB<\/a> directory if the pipeline hangs on startup.<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><strong><code>uv sync<\/code> vs <code>pip install<\/code>.<\/strong> The project uses <code>uv<\/code> for dependency management. If you install via <code>pip<\/code> instead, <code>hdbscan<\/code> and <code>umap-learn<\/code> can pull conflicting <code>numpy<\/code> versions. Stick with <code>uv sync<\/code>. <br><\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:heading -->\n<h2 class=\"wp-block-heading\"><strong>3. Data model<\/strong><\/h2>\n<!-- \/wp:heading -->\n\n<!-- wp:table {\"hasFixedLayout\":false,\"fontSize\":\"medium\"} -->\n<figure class=\"wp-block-table has-medium-font-size\"><table><thead><tr><th>Table \/ Entity<\/th><th>Description<\/th><\/tr><\/thead><tbody><tr><td><code>brands<\/code><\/td><td>Master list of <strong>200+<\/strong> brands (internal) or <strong>20 brands<\/strong> (toy dataset) with metadata (industry, country)<\/td><\/tr><tr><td><code>perception_signals<\/code><\/td><td>One row per (brand \u00d7 sensor) \u2014 raw text summaries from each source<\/td><\/tr><tr><td><code>embeddings<\/code><\/td><td>Semantic vector per perception signal \u2014 <a href=\"https:\/\/ai.google.dev\/gemini-api\/docs\/embeddings\"><code>gemini-embedding-001<\/code><\/a>, <strong>768 dimensions<\/strong><\/td><\/tr><tr><td><code>umap_projections<\/code><\/td><td>2D coordinates per embedding after <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> reduction<\/td><\/tr><tr><td><code>clusters<\/code><\/td><td>Cluster ID, 3-word label (e.g. \"Hope Innovation Compassion\"), and member brands<\/td><\/tr><tr><td><code>bav_attributes<\/code><\/td><td>48 BAV imagery attribute scores per brand \u00d7 12 audience segments<\/td><\/tr><tr><td><code>consistency_scores<\/code><\/td><td>Omnichannel consistency % per brand (mean distance to centroid across sensors)<\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><em>The data model below describes what lives in <\/em><a href=\"https:\/\/docs.lancedb.com\/quickstart\">LanceDB<\/a><em> after the pipeline runs. Each table feeds a different part of the dashboard UI. The key thing to understand: a single brand has multiple rows across these tables (one per sensor, one per audience segment, etc.). The perception map plots one dot per row in<\/em><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><em>umap_projections<\/em><em>, not one dot per brand.<\/em><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Toy dataset format<\/strong> (CSV, used by public GitHub default mode):<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table {\"hasFixedLayout\":false,\"fontSize\":\"medium\"} -->\n<figure class=\"wp-block-table has-medium-font-size\"><table><thead><tr><th>Column<\/th><th>Description<\/th><\/tr><\/thead><tbody><tr><td><code>Brand<\/code><\/td><td>Brand name (e.g. Aetherium, Zenith Dynamics)<\/td><\/tr><tr><td><code>Industry<\/code><\/td><td>Industry category (Technology, Automotive, Food &amp; Beverage, Retail, Healthcare, Finance, Entertainment)<\/td><\/tr><tr><td><code>Platform<\/code><\/td><td>Source sensor: TikTok (Brand Known), TikTok (Brand Unknown), Instagram (Brand Known), Instagram (Brand Unknown), Wikipedia, LLM, Survey<\/td><\/tr><tr><td><code>Survey_Audience<\/code><\/td><td>Demographic segment for Survey rows (e.g. Tech Early Adopters, Gen Z - Gamers); N\/A for non-survey<\/td><\/tr><tr><td><code>Brand_Perception_Report<\/code><\/td><td>Free-text perception summary<\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading -->\n<h2 class=\"wp-block-heading\"><strong>4. Pipeline \/ workflow<\/strong><\/h2>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p><em>The table below is the quick-reference version. Commentary after it explains what's actually happening at each stage and where things can go wrong.<\/em><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table {\"hasFixedLayout\":false,\"fontSize\":\"medium\"} -->\n<figure class=\"wp-block-table has-medium-font-size\"><table><thead><tr><th>Phase<\/th><th>Key numbers<\/th><\/tr><\/thead><tbody><tr><td><strong>1. Preprocess<\/strong><\/td><td>Instant runtime, merges into 1 unified DataFrame<\/td><\/tr><tr><td><strong>2. Ingest (Embed)<\/strong><\/td><td><a href=\"https:\/\/ai.google.dev\/gemini-api\/docs\/embeddings\"><code>gemini-embedding-001<\/code><\/a>, 768 dims, batches of 50 via <a href=\"https:\/\/cloud.google.com\/vertex-ai\">Vertex<\/a> APIs<\/td><\/tr><tr><td><strong>3. Aggregate<\/strong><\/td><td>Procrustes rotation on anchor brands; <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> <code>n_neighbors=min(15, len-1)<\/code>, <code>min_dist=0.1<\/code>, &lt;5 sec<\/td><\/tr><tr><td><strong>4. Cluster<\/strong><\/td><td><a href=\"https:\/\/hdbscan.readthedocs.io\/en\/latest\/\">HDBSCAN<\/a> (<code>min_cluster_size=3\/5<\/code>). Several minutes via <a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a> Batch labelling queue<\/td><\/tr><tr><td><strong>5. Consistency<\/strong><\/td><td><code>max(0.0, min(100.0, 100.0 - (mean_dist_to_centroid * 35.0)))<\/code><\/td><\/tr><tr><td><strong>6. BAV join<\/strong><\/td><td><a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.linalg.orthogonal_procrustes.html\">Procrustes alignment<\/a> via rotation matrix on overlapping anchor brands<\/td><\/tr><tr><td><strong>7. Atlas UI<\/strong><\/td><td><a href=\"https:\/\/streamlit.io\/\">Streamlit<\/a> (instant runtime)<\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":4,\"className\":\"is-style-text-subtitle\"} -->\n<h4 class=\"wp-block-heading is-style-text-subtitle\"><strong>Phase-by-phase commentary<\/strong><\/h4>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p><strong>Phase 1 \u2014 Preprocess.<\/strong> Deceptively simple: merge CSVs, normalise brand names, filter junk. The hidden complexity is <strong>name matching<\/strong>. <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> uses official corporate names, social media uses colloquial names, and <a href=\"https:\/\/www.wikipedia.org\/\">Wikipedia<\/a> uses yet another variant. <code>preprocess.py<\/code> maintains a manual alias map for this. If you add a new brand and it doesn't appear on the map, check the alias map first \u2014 it's almost always a name mismatch.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Phase 2 \u2014 Ingest (embed).<\/strong> Each <code>Brand_Perception_Report<\/code> text gets turned into a 768-dimensional vector via <a href=\"https:\/\/ai.google.dev\/gemini-api\/docs\/embeddings\"><code>gemini-embedding-001<\/code><\/a>. The critical thing to understand: <em>these vectors encode writing style as much as meaning<\/em>. A <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> report that says \"Helpful (45.2), Reliable (38.1)\" and a <a href=\"https:\/\/www.tiktok.com\/\">TikTok<\/a> report that says \"this brand gives cozy reliable vibes\" will land in <strong>different<\/strong> regions of embedding space even though they describe similar perceptions. This is the root cause of the domain shift problem solved in Phase 3.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Phase 3 \u2014 Aggregate (the hard one).<\/strong> This is where most of the iteration happened. Three things occur in sequence:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list {\"ordered\":true} -->\n<ol class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Social aggregation:<\/strong> Multiple post-level embeddings per (Brand \u00d7 Platform) are mean-averaged into a single vector, and <a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a> generates a summary report. This smoothing pulls social vectors toward a shared centroid.<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><strong>Procrustes alignment:<\/strong> The <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> vectors are rotated into the social embedding subspace using 202 shared anchor brands (see Section 6 for the full story).<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> projection: The combined, aligned vectors are reduced to 2D. Only the <code>'All Adults'<\/code> <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> slice is fitted alongside social platforms \u2014 this prevents the 12 <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> demographic segments from dominating the topology.<\/li>\n<!-- \/wp:list-item --><\/ol>\n<!-- \/wp:list -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>\ud83d\udd2c <strong>Why \"balanced subset fit\" matters.<\/strong> <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> has 12 audience segments per brand. Social has ~1\u20134 data points per brand. Without balancing, <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> sees 12\u00d7 more <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> points and builds its neighbourhood graph around <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> structure, marginalising social data. The fix: fit <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> on the balanced subset (All Adults + social), then <em>transform<\/em> the remaining 11 <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> segments passively. This was a non-obvious but critical design choice.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Phase 4 \u2014 Cluster.<\/strong> <a href=\"https:\/\/hdbscan.readthedocs.io\/en\/latest\/\">HDBSCAN<\/a> groups nearby points into perception themes, then <a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a> labels each cluster with exactly 3 words. <code>min_cluster_size<\/code> is set to 3 (toy dataset) or 5 (full dataset). The batch labelling step submits each cluster's centroid + top 5 most similar reports to <a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a>. This can take several minutes because it goes through the <a href=\"https:\/\/cloud.google.com\/vertex-ai\">Vertex AI<\/a> Batch queue \u2014 don't assume it hung.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Phase 5 \u2014 Consistency<\/strong>. A simple but effective metric: for each brand, compute the mean Euclidean distance from each sensor's point to the brand's centroid, then invert and scale. Brands where all sensors agree (Brand D, Brand E) score 99%+. Brands with platform-dependent perception (Brand G) score much lower. The <strong>* 35.0<\/strong> scaling factor was empirically tuned to spread scores across a useful range.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Phase 6 \u2014 <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> join.<\/strong> Brings in the raw 48-attribute <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> scores per demographic segment. These are the structured numbers (not the <a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a>-generated prose) and power the \"Survey Audience\" filter in the dashboard.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Phase 7 \u2014 Atlas UI.<\/strong> <a href=\"https:\/\/streamlit.io\/\">Streamlit<\/a> renders everything from <a href=\"https:\/\/docs.lancedb.com\/quickstart\">LanceDB<\/a>. Instant startup because all computation was done in previous phases.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading -->\n<h2 class=\"wp-block-heading\"><strong>5. Atlas interface (primary interactions)<\/strong><\/h2>\n<!-- \/wp:heading -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Focus brand selector<\/strong> \u2192 perception map + sidebar with cluster label, per-sensor summaries<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><strong>Survey audience filter (<a href=\"https:\/\/wppbav.com\/\">BAV<\/a>)<\/strong> \u2192 12 demographic segments<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><strong>Number of neighbours slider<\/strong> \u2192 controls perceptual neighbors on the map<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><strong>Reference platform selector<\/strong> \u2192 changes cross-modal overlap anchor (<a href=\"https:\/\/www.wikipedia.org\/\">Wikipedia<\/a>, <a href=\"https:\/\/wppbav.com\/\">BAV<\/a>, etc.)<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><strong>Competitor set toggle<\/strong> \u2192 <code>Show unexpected neighbours<\/code> (out-of-industry brands)<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p> \ud83c\udfaf <strong>What to look for when using the Atlas.<\/strong> The most interesting insights come from <em>disagreements<\/em> between sensors. If a brand's <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> dot and <a href=\"https:\/\/www.tiktok.com\/\">TikTok<\/a> dot are far apart, that's a signal: the structured survey perception (what people <em>say<\/em> when asked directly) differs from the organic social perception (what people <em>actually talk about<\/em>). The Brand vs Content Effect tab adds another layer \u2014 when you hide the brand name from social content, does the perception shift? If so, the brand's reputation is doing heavy lifting independent of the product itself.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading -->\n<h2 class=\"wp-block-heading\"><strong>6. Domain-specific mechanics<\/strong><\/h2>\n<!-- \/wp:heading -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\">6.1 Why <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> is \"ground truth\"<\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>Survey-based, 48 structured imagery attributes, 30+ years of longitudinal data, 12 demographic segments. <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> captures deep-seated beliefs shielded from daily social flux. Social algorithms change daily; survey-based trait grids collected over decades establish structured cognitive associations completely shielded from current hype timelines.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>\ud83d\udcd6 <strong>For non-specialists:<\/strong> WPP <a href=\"https:\/\/wppbav.com\/\">(BAV)<\/a> is one of the largest brand research databases in the world, maintained by WPP. It works by asking thousands of consumers to rate brands on 48 specific attributes \u2014 things like \"Helpful,\" \"Innovative,\" \"Trustworthy\" \u2014 scored on a numeric scale. Because the same questions are asked year after year across demographic segments, <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> gives you a stable, structured snapshot of how people <em>think<\/em> about a brand when prompted. Social media gives you what people <em>spontaneously say<\/em>. These are fundamentally different signals, and combining them is the core challenge of this project.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\">6.2 The <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> alignment problem - a full account<\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p><em>This section documents the central technical challenge of the Atlas and the iterative process that solved it.<\/em><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":4,\"className\":\"is-style-default\"} -->\n<h4 class=\"wp-block-heading is-style-default\">The problem<\/h4>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>When the Atlas was first built, the <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> perception map split cleanly down the middle: all <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> dots on the left, all social dots on the right, regardless of whether they described the exact same brand. Brand A's <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> point and Brand A's <a href=\"https:\/\/www.tiktok.com\/\">TikTok<\/a> point would be in completely different regions of the map. This made the entire visualisation useless for cross-platform comparison.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The separation was <strong>not<\/strong> evidence that <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> survey data captures genuinely different brand perceptions from social media. It was a <strong>methodological artefact<\/strong> caused by two compounding issues in the pipeline.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":4} -->\n<h4 class=\"wp-block-heading\">Root cause: different text domains fed to the same embedding model<\/h4>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>All platforms are embedded with the same model (<a href=\"https:\/\/ai.google.dev\/gemini-api\/docs\/embeddings\"><code>gemini-embedding-001<\/code><\/a>), but the text being embedded is fundamentally different in style, vocabulary, and structure:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table {\"hasFixedLayout\":false,\"fontSize\":\"medium\"} -->\n<figure class=\"wp-block-table has-medium-font-size\"><table><thead><tr><th>Platform<\/th><th><code>Brand_Perception_Report<\/code> content<\/th><th>Source<\/th><\/tr><\/thead><tbody><tr><td><a href=\"https:\/\/wppbav.com\/\"><strong>BAV<\/strong><\/a><\/td><td>LLM-generated narrative from 48 numerical imagery sensors, e.g. <em>\"Helpful (45.2), Reliable (38.1)\u2026\"<\/em> \u2192 Gemini prompt \u2192 prose paragraph<\/td><td><code>analysis.py<\/code>, <code>preprocess.py<\/code><\/td><\/tr><tr><td><strong><a href=\"https:\/\/www.tiktok.com\/\">TikTok<\/a> \/ <a href=\"https:\/\/www.instagram.com\/\">Instagram<\/a><\/strong><\/td><td>LLM-generated perception report from watching a single video\/post<\/td><td><code>preprocess.py<\/code><\/td><\/tr><tr><td><a href=\"https:\/\/www.wikipedia.org\/\"><strong>Wikipedia<\/strong><\/a><\/td><td>LLM-generated perception from <a href=\"https:\/\/www.wikipedia.org\/\">Wikipedia<\/a> article text<\/td><td><code>preprocess.py<\/code><\/td><\/tr><tr><td><strong>LLM (<a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a>)<\/strong><\/td><td>Direct LLM perception (<a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a> asked \"what do you think of brand X?\")<\/td><td>Same as Wiki source file<\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> text originates from a <strong>double LLM transformation<\/strong>: raw survey numbers \u2192 <code>generate_semantic_statement()<\/code> (structured string like <em>\"Full <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> Imagery Profile (48 Sensors): Helpful (45.2), Reliable (38.1), \u2026\"<\/em>) \u2192 <a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a> prompt \u2192 narrative paragraph. The social text comes from a <strong>single LLM step<\/strong> interpreting raw video\/post content directly.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>This means the embedding model sees completely different linguistic distributions for <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> vs social. The <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> narratives share a common templated style (always referencing \"imagery sensors,\" \"quantitative data,\" survey language) while social narratives use informal, media-oriented language. <strong>Embedding models encode <em>how<\/em> something is said as much as <em>what<\/em> is said<\/strong>, so this systematic style difference pushes <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> vectors into a distinct cluster regardless of actual brand perception agreement.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":4} -->\n<h4 class=\"wp-block-heading\">Compounding factor 1: aggregation asymmetry<\/h4>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>In <code>aggregate.py<\/code>, social data (<a href=\"https:\/\/www.tiktok.com\/\">TikTok<\/a>, <a href=\"https:\/\/www.instagram.com\/\">Instagram<\/a>) has multiple post-level embeddings that are <strong>mean-averaged<\/strong> per (Brand, Platform), and a <a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a> summary replaces the report text. This smoothing pulls social vectors toward a shared centroid. <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> \/ <a href=\"https:\/\/www.wikipedia.org\/\">Wikipedia<\/a> \/ <a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a> data (<code>df_research<\/code>) passes through as-is with <code>post_count = 1<\/code> \u2014 no averaging occurs.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>The result: social embeddings are inherently more \"central\" (mean-regression effect), while <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> embeddings retain their full individual variance. When <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> runs on this combined set, the social vectors cluster tighter and the un-averaged <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> vectors spread out differently.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":4} -->\n<h4 class=\"wp-block-heading\">Compounding factor 2: UMAP sees the domain gap<\/h4>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p><a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> is run on the entire combined dataset with <code>n_neighbors=15<\/code>. Because <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> embeddings share a systematic style signature different from social embeddings, <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a>'s neighbourhood graph naturally groups them apart \u2014 it finds the <strong>text-style cluster<\/strong>, not a genuine <strong>perception cluster<\/strong>.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":4} -->\n<h4 class=\"wp-block-heading\">What we tried (in order)<\/h4>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>\ud83d\udd2c <strong>Option A \u2014 normalise the text domain (first attempt).<\/strong> Rewrote the <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> perception report generation prompt in <code>analysis.py<\/code> to produce output that mimics the style of a social\/<a href=\"https:\/\/www.wikipedia.org\/\">Wikipedia<\/a> perception report. Specifically: removed references to \"<a href=\"https:\/\/wppbav.com\/\">BAV<\/a>,\" \"imagery sensors,\" \"quantitative data\" from the prompt. Used the same persona\/format instructions as the social platform reports \u2014 describing <em>what the brand feels like<\/em> rather than referencing the data source. The prompt became: <em>\"Based on the following consumer perception data, write a concise paragraph describing how this brand is perceived by consumers. Focus on: overall vibe, what people praise, what people criticise, and who the typical customer is. Write as if describing public perception \u2014 do not reference the data source or format.\"<\/em><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><strong>Result:<\/strong> Reduced but <strong>did not eliminate<\/strong> the <a href=\"https:\/\/wppbav.com\/\">BAV<\/a>\/social separation. The templated numerical origin still leaked through in subtle ways.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>\u2705<strong>Option B \u2014 Procrustes alignment (the solution).<\/strong> Use <a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.linalg.orthogonal_procrustes.html\"><code>scipy.linalg.orthogonal_procrustes<\/code><\/a> to align the <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> embedding subspace to the social subspace before combining. This preserves within-platform structure while removing the cross-platform domain shift. <strong>This is what the pipeline uses today.<\/strong><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><em>Option C (embed a standardised perception schema across all platforms) remains a potential future improvement but requires significantly more work.<\/em><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>\u274c <strong><s>Option D \u2014 Per-platform z-score normalisation (ruled out).<\/s><\/strong> Apply per-platform z-score normalisation to embedding vectors before <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a>, centring each platform's distribution to zero-mean and unit-variance. This would remove the systematic offset but also <strong>mask any genuine platform-level differences<\/strong> \u2014 making it a workaround, not a proper fix. Discarded.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":4} -->\n<h4 class=\"wp-block-heading\">How Procrustes alignment works in the pipeline<\/h4>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>\ud83d\udcd6<strong>For non-specialists:<\/strong> Procrustes alignment is a mathematical technique from shape analysis, named after a figure in Greek mythology who stretched or cut people to fit his bed. In our context, it \"stretches\" one cloud of data points to best overlap with another. Critically, it only uses <strong>rotation<\/strong> (spinning) and <strong>scaling<\/strong> \u2014 it doesn't distort the internal relationships between points within each cloud. So the relative positions of <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> brands among themselves are preserved, but the entire <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> cloud is repositioned to overlap with the social cloud.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>Here's exactly what the code in <code>aggregate.py<\/code> does:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list {\"ordered\":true} -->\n<ol class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Finds the anchors.<\/strong> Identifies every brand that exists in <em>both<\/em> the <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> data and the social data (e.g., Brand A exists in both). In the production pipeline log, this found <strong>202 anchor brands<\/strong>.<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><strong>Computes the transformation.<\/strong>For each anchor brand, computes the mean social vector across all its social platforms, centres both the <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> anchor vectors and the social anchor vectors, and then uses <a href=\"https:\/\/docs.scipy.org\/doc\/scipy\/reference\/generated\/scipy.linalg.orthogonal_procrustes.html\"><code>scipy.linalg.orthogonal_procrustes<\/code><\/a> to find the optimal <strong>rotation matrix R<\/strong> that maps <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> \u2192 social.<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><strong>Applies the rotation.<\/strong> Multiplies <em>all<\/em> <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> vectors (even brands that didn't have social data) by this rotation matrix, moving the entire <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> dataset into the social media spatial domain.<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><strong>Logs quality.<\/strong> Reports the Frobenius residual so alignment quality is traceable.<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><strong>Safeguard.<\/strong> If fewer than 10 shared brands exist, alignment is skipped with a warning \u2014 Procrustes is unreliable with too few anchors.<\/li>\n<!-- \/wp:list-item --><\/ol>\n<!-- \/wp:list -->\n\n<!-- wp:heading {\"level\":4} -->\n<h4 class=\"wp-block-heading\">How this changes interpretation of the dashboard<\/h4>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>This alignment profoundly upgrades what you can conclude from the perception map:<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>True cross-platform comparisons.<\/strong> If a <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> dot and a <a href=\"https:\/\/www.tiktok.com\/\">TikTok<\/a> dot for \"Brand B\" sit right next to each other, it now genuinely implies that the core sentiment in the structured survey data closely matches the organic social conversations. Before Procrustes, proximity between <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> and social points was meaningless.<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><strong>Distances have semantic meaning.<\/strong> By forcibly removing the structural domain shift, any remaining distance between two points is entirely due to a difference in <em>meaning and perception<\/em>. If the <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> point for a brand is far from its Instagram point, you can confidently analyse that gap as a genuine difference in audience perception or marketing strategy \u2014 not just an artefact of different text formatting.<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><strong>Unified clustering.<\/strong> When <a href=\"https:\/\/hdbscan.readthedocs.io\/en\/latest\/\">HDBSCAN<\/a> runs over this aligned space, it can finally cluster <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> reports together with social reports. The LLM-generated theme labels now encompass insights drawn from both quantitative surveys and viral videos simultaneously.<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\">6.3 UMAP parameter sensitivity<\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p><code>min_dist=0.1<\/code>, <code>n_components=2<\/code>, <code>metric='cosine'<\/code>. <strong>Balanced subset fit<\/strong>: only <code>'All Adults'<\/code> <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> slice is fitted alongside social platforms, preventing 12 <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> audiences from overpowering topology. <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> was chosen over t-SNE because it enables saving the <code>reducer<\/code> object \u2014 the pipeline strictly <em>fits<\/em> on one balanced subset and passively <em>transforms<\/em> newly injected demographics.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>\u26a0\ufe0f <strong>Watch out:<\/strong> Changing <code>min_dist<\/code> has outsized effects on the map. Lower values (e.g., 0.01) create tighter, more separated clusters \u2014 visually dramatic but can split genuinely related brands. Higher values (e.g., 0.5) spread everything into a uniform blob. The current <code>0.1<\/code> was chosen as a balance after visual inspection across multiple brand sets. If you change it, re-check whether brands with known perceptual similarity (e.g., brands in the same industry) still land in the same neighbourhood.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\">6.4 Cluster labelling<\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>Automated via LLM. <a href=\"https:\/\/cloud.google.com\/vertex-ai\">Vertex AI<\/a> Batch submits centroid + top 5 reports (by cosine similarity) to <a href=\"https:\/\/cloud.google.com\/ai\/gemini\">Gemini<\/a> \u2192 <strong>exactly 3 words<\/strong>. The 3-word constraint forces abstraction \u2014 \"Hope Innovation Compassion\" rather than a paragraph. If labels feel wrong, the issue is almost always that the cluster itself is incoherent (check <a href=\"https:\/\/hdbscan.readthedocs.io\/en\/latest\/\">HDBSCAN<\/a>'s <code>min_cluster_size<\/code>), not that the LLM mislabelled it.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\">6.5 Omnichannel consistency<\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p><code>100.0 - (mean Euclidean distance * 35.0)<\/code>, clamped 0\u2013100%. Tight overlaps hit 99%+ (Brand D, Brand E), dispersed shifts drop fast (Brand G). The <code>* 35.0<\/code> multiplier is an empirically tuned scaling factor \u2014 if you add new sensors or change the embedding model, you may need to recalibrate it so scores distribute meaningfully across 0\u2013100%.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading {\"level\":3} -->\n<h3 class=\"wp-block-heading\">6.6 Content vs brand effect methodology<\/h3>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>Tracks <code>shift_2d<\/code> (Euclidean magnitude), <code>cos_shift<\/code> (cosine diff between brand known\/brand unknown), and <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> baseline deltas (<code>bav_delta_known<\/code> vs <code>bav_delta_unk<\/code>). Parses exact LLM cluster words added\/lost due to brand awareness.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>\ud83d\udca1 <strong>Why this matters.<\/strong> Social media perception reports are generated from video\/post content. When the brand name is visible, the LLM's perception is coloured by everything it \"knows\" about that brand. When the brand name is hidden, the LLM can only react to what it actually sees in the content. The delta between these two tells you how much of a brand's social perception is driven by <em>brand reputation<\/em> vs <em>actual content quality<\/em>. Large shifts indicate the brand name is doing heavy lifting.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading -->\n<h2 class=\"wp-block-heading\"><strong>7. Module map<\/strong><\/h2>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p><em>The codebase is intentionally small. Every module does one thing. If you're debugging, start by identifying which phase failed (check the CLI output), then go straight to the corresponding file.<\/em><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:code -->\n<pre class=\"wp-block-code\"><code>brand_perception\/dashboard\/atlas_pipeline\/\n\u251c\u2500\u2500 main.py              (43 lines)    CLI entry point\n\u251c\u2500\u2500 dashboard_v1.py      (~1133 lines) Core Streamlit frontend\n\u2514\u2500\u2500 src\/pipeline\/\n    \u251c\u2500\u2500 preprocess.py    (130 lines)   Sanitizes, normalises, filters into LanceDB schemas\n    \u251c\u2500\u2500 ingest.py        (126 lines)   GenAI models \u2192 768-D embeddings\n    \u251c\u2500\u2500 aggregate.py     (303 lines)   Procrustes, LLM reports, UMAP layouts\n    \u2514\u2500\u2500 cluster.py       (244 lines)   HDBSCAN groups + batch cluster labels\n<\/code><\/pre>\n<!-- \/wp:code -->\n\n<!-- wp:paragraph -->\n<p><strong>Total:<\/strong> ~1,979 lines across 6 modules.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>\ud83d\udcd6 <strong>Where the complexity lives.<\/strong> The line counts understate the complexity. <code>aggregate.py<\/code> at 303 lines is where 80% of the intellectual difficulty sits \u2014 it handles social aggregation, Procrustes alignment, and <a href=\"https:\/\/umap-learn.readthedocs.io\/en\/latest\/\">UMAP<\/a> projection. <code>dashboard_v1.py<\/code> at ~1,133 lines is the largest file but is mostly <a href=\"https:\/\/streamlit.io\/\">Streamlit<\/a> layout code. If you're onboarding, read <code>aggregate.py<\/code> first; it's where the science meets the engineering.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading -->\n<h2 class=\"wp-block-heading\"><strong>8. Test coverage<\/strong><\/h2>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p>~6 functional \/ integration tests. Runtime &lt;1 minute.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table {\"hasFixedLayout\":false,\"fontSize\":\"medium\"} -->\n<figure class=\"wp-block-table has-medium-font-size\"><table><thead><tr><th>Test file<\/th><th>Purpose<\/th><th>Count<\/th><\/tr><\/thead><tbody><tr><td><code>scripts\/test_bav_pipeline.py<\/code><\/td><td>BAV baseline ingestion flow<\/td><td>1<\/td><\/tr><tr><td><code>dev\/test_apify*<\/code><\/td><td>Social Scraper (TikTok\/Instagram)<\/td><td>2<\/td><\/tr><tr><td><code>brand_perception\/api\/test_agent.py<\/code><\/td><td>Job Orchestrator backend queues<\/td><td>2<\/td><\/tr><tr><td><code>research\/test_scrape_jh.py<\/code><\/td><td>Manual methodology mockups<\/td><td>1<\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading -->\n<h2 class=\"wp-block-heading\"><strong>9. Design decisions<\/strong><\/h2>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p><em>These are the answers to questions that came up during development where the wrong choice would have broken the system.<\/em><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:table {\"hasFixedLayout\":false,\"fontSize\":\"small\"} -->\n<figure class=\"wp-block-table has-small-font-size\"><table><thead><tr><th>ID<\/th><th>Decision<\/th><th>Rationale<\/th><th>What happens if you reverse it<\/th><\/tr><\/thead><tbody><tr><td>DD-BPA-1<\/td><td>UMAP over t-SNE<\/td><td>Better distance proportionality; enables saving <code>reducer<\/code> to <em>fit<\/em> on balanced subset and <em>transform<\/em> new demographics<\/td><td>t-SNE can't transform new points \u2014 you'd have to re-run the entire projection every time a new BAV demographic segment is added, and distances between clusters become meaningless<\/td><\/tr><tr><td>DD-BPA-2<\/td><td>BAV as ground-truth anchor<\/td><td>Survey-based trait grids over decades, shielded from daily social hype<\/td><td>Using social as ground truth would anchor the map to volatile, algorithm-dependent signals \u2014 the map would shift with every TikTok trend cycle<\/td><\/tr><tr><td>DD-BPA-3<\/td><td>Semantic embeddings over keywords<\/td><td>Captures meaning (\"Luxury\" \u2248 \"Premium\" \u2248 \"Prestigious\")<\/td><td>Keyword-based approaches treat \"Luxury\" and \"Premium\" as unrelated tokens \u2014 brands described with different vocabulary but identical perception would never cluster together<\/td><\/tr><tr><td>DD-BPA-4<\/td><td>Procrustes alignment<\/td><td>Solves text heterogeneity (surveys vs social) via rotation on anchor brands. Prompt normalisation alone (Option A) reduced but did not eliminate the domain gap<\/td><td>Without it, the map splits by text style (BAV left, social right) rather than by brand perception \u2014 see Section 6.2 for the full account<\/td><\/tr><tr><td>DD-BPA-5<\/td><td>Local first public release (no API keys required)<\/td><td>Lowers barrier to entry; toy dataset enables immediate exploration without infrastructure<\/td><td>Requiring API keys upfront would prevent most people from ever trying the tool \u2014 the toy dataset lets someone see the full Atlas UI in under 5 minutes<\/td><\/tr><\/tbody><\/table><\/figure>\n<!-- \/wp:table -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading -->\n<h2 class=\"wp-block-heading\"><strong>10. Extending the System<\/strong><\/h2>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p><em><em>The pipeline was designed to be extended, each of these is a realistic next step, listed in order of effort.<\/em><\/em><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:list {\"ordered\":true} -->\n<ol class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Run with custom data<\/strong> \u2014 Format your own dataset as CSV matching the toy dataset schema (Brand, Industry, Platform, Survey_Audience, Brand_Perception_Report), drop into data directory, run default mode. This is the zero-effort way to test the Atlas on a new domain.<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><strong>Add a sensor<\/strong> \u2014 Collect as CSV\/Parquet, import in <code>preprocess.py<\/code> under <code>FINAL_COLUMN_ORDER<\/code> (<code>Super_Platform<\/code>, <code>Year<\/code>, <code>Brand<\/code>, <code>Raw_Text<\/code>), run <code>main.py<\/code>. The Procrustes alignment will automatically include the new sensor in its anchor calculation if the new sensor shares brands with existing sources.<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><strong>Add a market<\/strong> \u2014 Reroll <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> datasets in <code>.\/paths<\/code>, override GCS env vars, rerun batches. Note that Procrustes alignment quality depends on having enough shared anchor brands between <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> and social data \u2014 if you enter a market where <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> coverage is thin, check the Frobenius residual in the logs.<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><strong>Temporal tracking<\/strong> \u2014 <a href=\"https:\/\/docs.lancedb.com\/quickstart\">LanceDB<\/a> already stores <code>Year<\/code> and <code>BAV_Study<\/code>; add slide toggle in <code>dashboard_v1.py<\/code>. This would let you see how a brand's perception drifts over time across sensors \u2014 one of the most requested features.<\/li>\n<!-- \/wp:list-item --><\/ol>\n<!-- \/wp:list -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>\u26a0\ufe0f <strong>If you add a new sensor:<\/strong> remember that the Procrustes alignment currently rotates <a href=\"https:\/\/wppbav.com\/\">BAV<\/a> into the social subspace specifically. If your new sensor has a similarly distinct text style (e.g., Reddit comments vs <a href=\"https:\/\/www.tiktok.com\/\">TikTok<\/a> captions), you may see a new domain gap. In that case, consider extending the alignment step to handle multiple source-target pairs, or grouping sensors into \"formal\" and \"informal\" categories for alignment.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:heading -->\n<h2 class=\"wp-block-heading\"><strong>11. Results (end-to-end validation)<\/strong><\/h2>\n<!-- \/wp:heading -->\n\n<!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Internal coverage:<\/strong> 200+ brands, 4,000+ data points across 5 modalities, 12 demographic segments<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><strong>Public toy dataset:<\/strong> 20 brands, 681 perception reports across 7 industries, all 5 sensor types + brand known\/brand unknown variants<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><strong>Cross-industry insight validation:<\/strong><!-- wp:list -->\n<ul class=\"wp-block-list\"><!-- wp:list-item -->\n<li><strong>Omnichannel consistency:<\/strong> Brand D, Brand E at 99%+; Brand G identified as multi-faceted<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><strong>Shared equity, different vibe:<\/strong> Brand H \u2194 Brand G (close on <a href=\"https:\/\/wppbav.com\/\">BAV<\/a>, far on socials)<\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><strong>Different equity, shared vibe:<\/strong> Brand K \u2194 Brand L (far on <a href=\"https:\/\/wppbav.com\/\">BAV<\/a>, converged on socials)<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list --><\/li>\n<!-- \/wp:list-item -->\n\n<!-- wp:list-item -->\n<li><strong>Validation metrics:<\/strong> Procrustes Residuals for subspace overlap + brand known\/brand unknown cosine similarity differentials<\/li>\n<!-- \/wp:list-item --><\/ul>\n<!-- \/wp:list -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p>\ud83d\udca1<strong>How to read these results.<\/strong> The \"shared equity, different vibe\" and \"different equity, shared vibe\" patterns are the most commercially interesting findings. They reveal cases where a brand's formal positioning (<a href=\"https:\/\/wppbav.com\/\">BAV<\/a>) disagrees with its organic social presence \u2014 exactly the kind of insight that's invisible to either data source alone. The Atlas's value proposition is making these cross-modal disagreements visible and quantifiable.<\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:separator -->\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n<!-- \/wp:separator -->\n\n<!-- wp:heading -->\n<h2 class=\"wp-block-heading\">\ud83d\udd17 Repositories<\/h2>\n<!-- \/wp:heading -->\n\n<!-- wp:paragraph -->\n<p><strong>GitHub<\/strong>: <em>(GitHub repo coming soon)<\/em><\/p>\n<!-- \/wp:paragraph -->\n\n<!-- wp:paragraph -->\n<p><\/p>\n<!-- \/wp:paragraph -->","content_quarter":"Q1 2026","related_pods":["120"],"featured":"","legacy_perspective_source_id":""},"_links":{"self":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/research_feed\/1651","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/research_feed"}],"about":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/types\/research_feed"}],"author":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/users\/19"}],"acf:post":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/research_pods\/120"}],"wp:attachment":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1651"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1651"},{"taxonomy":"content_type","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcontent_types&post=1651"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fppma_author&post=1651"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}