Author: Jaclyn Harron

New Blog Post

March 30, 2026

Brand Perception Atlas: Mapping the modern brand, from social signal to core equity

Introduction

A brand is what people repeatedly, collectively, and emotionally decide it is, not simply what the company declares.

In 2026, those decisions are happening everywhere: on Instagram, TikTok, in reviews, in news cycles, in comment sections, and in the accumulated memory of long-term brand equity.

For organisations attempting to understand brand perception, this environment presents a fundamental challenge. The volume of available data is unprecedented, yet the signals it produces are often inconsistent and contradictory. Traditional research tools such as surveys and focus groups remain essential for measuring brand equity, but they capture perception only at specific points in time and cannot fully reflect the fast-moving nature of digital conversation.

At the same time, social media offers a continuous stream of public commentary, revealing how brands are discussed, interpreted, and compared in everyday discourse. However, these signals are noisy and difficult to interpret in isolation.

The Brand Perception Atlas was built to turn that noise into a map, designed to integrate diverse perception signals into a unified analytical framework. By combining social media data, public knowledge sources, Large Language Model (LLM) summaries, and established brand equity research, the Atlas aims to provide a more comprehensive understanding of how brands are perceived across the digital ecosystem.

The first iteration of this project analysed perception signals for more than 200 brands and over 4,000 source items across the included sensors, in 2025–26 for the US market, enabling the construction of a visual representation of brand perception that reveals relationships, consistencies, and divergences across multiple sources.

A short, video walkthrough of an anonymised version of the Atlas can be viewed below ⬇️

The Brand Perception Atlas

The Brand Perception Atlas functions as a navigational system for brand strategy. Individual data points reveal little on their own, but when thousands of signals are mapped together, larger patterns come into view.

The Brand Perception Atlas turns those scattered signals into a shared perceptual map, showing how brands cluster, where they compete, and which meanings they occupy in the public imagination.

To achieve this, the Atlas synthesises perception signals from several sources:

Content from official brand accounts on platforms such as TikTok and Instagram
Public narratives reflected in sources like Wikipedia
AI-generated summaries, using Gemini, describing how brands are perceived in LLM-based discourse.
To anchor the analysis in long-term brand perception, the system incorporates survey-based brand equity data from the WPP Brand Asset Valuator® (BAV).

BAV holds a special place among brand-perception sensors. Developed by WPP, BAV is one of the world’s largest and longest-running brand equity studies, spanning more than three decades, thousands of brands, and multiple markets. Unlike social and digital signals, which infer perception from public behaviour, BAV measures it directly by asking consumers what they believe. That makes it the Atlas’s anchor: not a snapshot of what people are saying today, but a benchmark for what they have come to believe over time.

BAV captures this through 48 standardised imagery attributes, from functional traits such as Reliable and High Quality to more emotional cues such as “Charming”, “Daring”, and “Friendly”. In the Atlas, those attributes provide a high-resolution view of brand meaning, making it possible to see exactly which dimensions of perception define its underlying equity.

Table 1 presents samples from all Atlas sensors for two well-known brands: Brand A (a major US retailer) and Brand B (a global travel platform). The table illustrates how the same brand can be perceived very differently across sensors, highlighting the need for a tool like the Brand Perception Atlas, which brings these diverse perspectives together into a single, contextualised view.

Sample brand perception reports
Source	Brand A	Brand B
Survey (BAV)(Long-term equity)	“Consumers perceive this brand as a highly accessible and dependable choice, offering excellent value for money. It consistently earns praise for its reliable, high-quality, and original offerings.”	“Perceived as a highly original, daring, and progressive leader in the travel space. It is seen as a ‘cool’ and ‘friendly’ brand that offers unique, high-quality experiences.”
Wikipedia (public narrative)	“Positioned as offering upscale products at below-average costs, appealing to a younger, more educated, and higher-income demographic.”	“A global travel platform. However, its narrative is often complicated.”
Gemini LLM (digital discourse)	“Perceived as a clean, organised, and pleasant one-stop-shop that blends everyday necessities with trendy, affordable finds.”	“Widely viewed as a pioneer of authentic travel, praised for design and convenience.”
Instagram (brand official account)	“An accessible and enjoyable retail destination… The playful and organised presentation of shopping reinforces a positive, discovery-driven customer experience.”	“A visually stunning showcase of bucket-list stays and architectural marvels. The brand projects an aspirational yet community-focused vibe.”
TikTok (brand official account)	“Widely seen as a trendy, accessible retailer offering stylish, curated products and collaborations.”	“Highly energetic and trend-focused. The vibe is one of discovery and adventure, making global travel feel personal and attainable.”

Table 1: Sample brand perception reports, for Brand A and Brand B, from the Brand Perception Atlas dataset

Core brand identity

When a focus brand is selected, the Atlas reveals its core brand identity: a breakdown of the top perceptual themes by channel. This provides an at-a-glance view of what each sensor is saying about the brand, and where those signals align or diverge.

For example, selecting Brand C (a global technology company) surfaces a clear identity profile. On BAV, the brand is perceived as Reliable, Innovative, and Intelligent. On Wikipedia, the dominant themes shift to Innovative, Dominant, and Controversial. Each sensor contributes a different angle, but together they reveal the full perceptual picture.

Figure 1: Screenshot of the selected brand’s core brand identity

Mapping brand perception

At the centre of the Atlas is the Perception Map, a visual representation of how brands relate to one another in terms of public perception.

Each perception signal is converted into a numerical representation of semantic meaning, known as an embedding. A helpful way to think about semantic meaning is as the “essence” of a word, rather than the word itself. For example, although “Luxury” and “High-end” are lexically different, they convey very similar meanings. The Atlas uses semantic Gemini embeddings to capture these relationships and understand the semantic similarity between terms such as “Luxury,” “Premium,” “Prestigious,” and “High-End.” Because their meanings are closely related, the system places brands characterised by these words in the same neighbourhood on the map.

The resulting map resembles a landscape of brand meaning, allowing brand leaders to identify clusters of brands that share common associations and spot outliers that occupy distinctive perceptual positions.

Omnichannel consistency

One of the most useful insights derived from the Atlas is a metric that we refer to as omnichannel consistency. This measure evaluates how closely aligned a brand’s perception is across different information sources. If the signals derived from social media, surveys, and public narratives cluster tightly together, the brand is communicating a consistent identity. Conversely, if these signals are widely dispersed, the brand’s presence is more diverse.

Analysis of the dataset identified several brands with exceptionally strong consistency across channels, including Brand D (an industrial equipment manufacturer), Brand E (a heavy equipment manufacturer), and Brand F (a health insurance provider), each showing more than 99% omnichannel consistency. Brand D is a clear example: its core perception of rugged reliability remains stable whether measured through a 30-year longitudinal BAV survey or reflected in viral TikTok videos. All Brand D signals appear tightly clustered on the perception map across every sensor.

The brand’s digital content visualises a bridge between historical heritage and modern utility. This alignment is not merely aesthetic but is deeply rooted in a consistent cross-platform narrative.

On BAV, the brand is perceived as a “formidable and dependable leader,” successfully blending a “rugged” and “traditional” foundation with a “distinctively cool and stylish” upper-class appeal. It is consistently rated as a “best-in-class” choice that is “worth more,” signalling its status as a premium investment.
On Gemini, the narrative reinforces this by projecting the “vibe of a legacy American icon built on reliability.” The discourse centres on the “durability and longevity” of the equipment, where the brand’s well-known slogan continues to underpin a reputation for high performance.
On social sensors, the brand’s perception is characterised by “practical innovation”. Reports highlight a “strong brand affinity” built through “real-life customer stories centred around dedication and essential work.” While the content celebrates the “versatile and dependable” nature of the machinery for tasks like snow removal, it also reflects a modern tension: the high cost of entry and proprietary technology, which mirrors the BAV finding that the brand is perceived as a “significant investment.”

At the other end of the spectrum, Brand G (a global hospitality company) occupies distinct perceptual territories across platforms, with its social media presence emphasising aspirational luxury and curated experiences, while its long-term equity centres on reliability and traditional prestige. As shown in Figure 2, the Atlas surfaces this as a clear pattern, one that may reflect a deliberate strategy to engage different audiences through different channels.

The BAV sensor (bottom right in Figure 2) positions the brand within a territory of reliability, superior quality, and prestigious appeal, with BAV reports describing it as “a beacon of established excellence and prestige… commanding a perception of leadership and trustworthiness firmly rooted in tradition.”
The social sensors (Instagram and TikTok, upper right corner in Figure 2) position the brand in a complementary territory defined by aspirational luxury and curated escapism. In these digital spaces, Brand G’s presence is human-centric and inclusive, with official reports describing an organisation that “champions individuals” and demonstrates “deep cultural understanding.”

Figure 2: Screenshot of the Brand Perception Atlas showing the dispersed points of Brand G

The variance in the data speaks to a brand successfully managing a legacy reputation while aggressively chasing a modern, inclusive digital identity (see Table 2).

Sensor	Key perceptual snippet	Core narrative
BAV	“This brand is primarily perceived as highly reliable… a beacon of established excellence and prestige. It commands a perception of leadership and trustworthiness firmly rooted in tradition.”	The anchor: Focuses on dependability, classic excellence, and proven heritage.
Instagram	“Brand G… projects a vibe of inclusivity, reliability, and cultural understanding. Praised for its global reach, diverse workforce, and initiatives focused on social responsibility.”	The bridge: Humanises the giant; focuses on “people first” and cultural connectivity.
TikTok	“Cultivates an aspirational and exclusive vibe… promoting global travel and luxury experiences… purveyor of curated escapism and wellness.”	The future: Targets the frequent traveller seeking sophisticated, unique retreats and indulgence.

Table 2: Sample brand perception text, for Brand G, from the Brand Perception Atlas dataset

Interestingly, multiple travel-related brands exhibited high dispersity across the semantic perception space. The high volume of online discussion surrounding travel experiences, ranging from positive stories to customer complaints, may contribute to a more fragmented perception environment for brands in this sector.

Omnichannel consistency is not inherently good or bad. Some brands benefit from a tightly aligned identity across platforms, while others thrive by expressing different facets of themselves in different contexts. In categories such as entertainment, fashion, and travel, more fragmented perception may reflect adaptability and cultural relevance rather than weakness.

For this reason, the consistency metric is diagnostic, not prescriptive. It shows where a brand sits on the spectrum between a unified and multi-faceted perception, helping leaders assess whether that pattern aligns with their strategy.

Shared equity, different vibe (close on BAV, far on socials)

The Atlas also reveals unexpected relationships between brands in completely different industries. One of the clearest patterns appears when two brands share a similar equity foundation but project very different identities on social media.

A good example is Brand H (an industrial conglomerate) and Brand G. At first glance, they are not intuitive neighbours. Brand H is associated with science, engineering, and industrial innovation, while Brand G is associated with hospitality, travel, and aspirational leisure. On social media, they occupy very different cultural spaces, and in the Atlas’s social sensors they sit far apart.

Source	Brand H	Brand G
BAV	“Projects a vibe of rugged, energetic reliability combined with a visionary, original spirit… praised for its trustworthiness and distinctively high quality.”	“Perceived as superior quality and high-performing… described as unique, stylish, and authentically simple. Appeals to a sophisticated, aspirational lifestyle.”
Gemini	“Deeply divided: seen as a legacy American innovator of household staples, but its reputation is tarnished by high-profile legal and environmental controversies.”	“A vast corporate giant in hospitality. Reputation has a dual identity: a provider of aspirational luxury vs. an impersonal entity with inconsistent service.”
Instagram	“Science and technology powerhouse, praised for its problem-solving capabilities. Emphasises STEM education and its role in enabling future technologies.”	“Cultivates a multifaceted image as a global provider known for community, sustainability, and inclusivity. Projects a vibe of reliability and cultural understanding.”

Table 3: Sample brand perception reports, for Brand H and Brand G, from the Brand Perception Atlas dataset

Content such as Brand H’s Instagram posts around a youth science initiative creates a perception centred on innovation, education, and community. The brand appears inspiring, responsible, and forward-looking. This is very different from Brand G, whose social presence is shaped by luxury, travel, and experience.

However, the BAV sensor tells a different story. At the level of deeper brand equity, Brand H and Brand G emerge as close neighbours because both are anchored by reliability and leadership. In the consumer mind, Brand H functions as an innovation backbone, while Brand G functions as a service backbone. Their social expressions differ, but their underlying equity plays a similar emotional role: both are seen as dependable institutions.

A similar pattern appears with Brand I (a membership retail chain) and Brand J (a US airline), as shown in Table 4. These brands belong to very different categories, yet on the BAV sensor they appear close together within a shared “consumer champion” territory. Both brands are anchored by associations such as friendliness and reliability. Brand I is strongly linked to simplicity, while Brand J is associated with value. At a foundational level, both occupy a similar emotional space: trusted brands that provide essential services without the friction consumers often expect from their industries.

Source	Brand I	Brand J
BAV	“Predominantly perceived as high-value, reliable, and authentic… offers helpful and intelligent solutions while demonstrating a commitment to equality.”	“Largely perceived as fun, cool, and friendly… valued for its distinctiveness and high quality, contributing to a perception of being trendy and energetic.”
Wikipedia	“A highly successful, global membership-only club known for its value and its strong private-label brand.”	“A hybrid low-cost carrier that disrupted the airline industry with premium amenities. Recently faced scrutiny over alliances and operational reliability.”
Gemini	“Widely perceived as a members-only ‘treasure hunt,’ fostering a cult-like loyalty. Praised for unbeatable value, but criticised for a chaotic in-store experience.”	“A trendy, modern airline struggling to live up to its reputation. While the in-flight experience is praised, operational reliability is a significant pain point.”
Instagram	“Beloved, value-driven club offering a unique shopping experience. Transformed mundane shopping into a leisure activity and shareable ‘haul’ content.”	“Projects a customer-centric and approachable vibe. Highlights above-average amenities and playful brand interactions compared to typical budget carriers.”

Table 4: Sample brand perception reports, for Brand I and Brand J, from the Brand Perception Atlas dataset

This pattern is diagnostically useful, as it shows that brands can share the same equity backbone while expressing themselves very differently across platforms. Brand I doubles down on functional value, while Brand J leans into aspiration and lifestyle.

Despite operating in different categories and producing very different content, these brands serve the same emotional role for consumers.

Different equity, shared vibe (far on BAV, close on socials)

The Atlas also reveals brands that follow the opposite pattern: brands from very different industries that share little underlying equity, yet converge into a similar “vibe” on social media. In these cases, the social layer acts as a cultural blender, pulling very different brands into the same perceptual neighbourhood.

A strong example is Brand K (a packaged food company) and Brand L (a food and beverage manufacturer). At the level of long-term brand equity, these brands occupy distinct positions. Brand K is perceived as a “highly reliable, high-performance leader” whose vibe blends “traditional prestige” with a “glamorous and daring appeal.” Brand L, by contrast, is seen as an “original and unique leader” that fuses a “traditional foundation with a trendy, cool, and dynamic aesthetic,” praised for its “rugged appeal and fun character.”

According to the BAV sensor, while both are established food brands, they occupy meaningfully different perceptual territories: Brand K anchored in prestige and reliability, Brand L in originality and rugged charm.

Yet on social media, the distinction between them fades. On Instagram and TikTok, both brands converge into a shared neighbourhood of nostalgic comfort and family-friendly Americana.

According to these social sensors, Brand K “cultivates a vibe of reliable convenience and family-friendly nostalgia,” engaging audiences with creative recipe ideas rooted in American food culture. Brand L similarly “evokes a strong sense of nostalgic comfort and reliable quality,” projecting tradition, community, and corporate social responsibility. On social platforms, both brands occupy the same emotional space: trusted pantry staples working to stay relevant with modern, health-conscious consumers.

Source	Brand K	Brand L
BAV	“Perceived as a highly reliable, high-performance leader offering exceptional quality. Its vibe is one of traditional prestige blended with a glamorous and daring appeal, reflecting an intelligent and customer-caring image.”	“Widely perceived as an original and unique leader, skilfully blending a traditional foundation with a trendy, cool, and dynamic aesthetic. Praised for its high quality, rugged appeal, and fun character.”
Wikipedia	“A long-standing American multinational food manufacturer with a diverse portfolio. Focused on expanding health-conscious offerings and sustainability, though it has faced scrutiny over legal terms and health claims.”	“A long-established and highly diversified American food and beverage manufacturer. Known for an aggressive acquisition strategy that has transformed it into a Fortune 500 company with a broad portfolio.”
Gemini LLM	“Widely perceived as a dependable, nostalgic American staple increasingly viewed through a lens of corporate scrutiny. Discussions often focus on high sugar content.”	“Quintessential American heritage brand evoking nostalgia and comfort. While loyalty is built on consistent taste, the brand’s reputation has been challenged.”
Instagram	“Cultivates a vibe of reliable convenience and family-friendly nostalgia. Engages a wide audience with creative recipe ideas while facing criticism regarding the nutritional content of its processed foods.”	“Projects a vibe of a reputable company with an emphasis on tradition and community. Praised for its deep-rooted history and significant corporate social responsibility initiatives, including sustainability advancements.”
TikTok	“Presents a comfortable, family-friendly image rooted in American food culture. Struggles to fully resonate with health-conscious consumers due to a perception of limited innovation in its core processed staples.”	“Evokes a strong sense of nostalgic comfort and reliable quality. Working to engage younger audiences by promoting modern, sustainable practices and creative usage of its familiar pantry staples.”

Table 5: Sample brand perception reports, for Brand K and Brand L, from the Brand Perception Atlas dataset

In this digital layer, Brand K’s recipe-driven content and Brand L’s heritage-focused storytelling occupy the same neighbourhood. They are no longer distinguished by their different equity profiles; instead, they are unified as nostalgic American pantry brands that use family-friendly content and tradition to build emotional resonance with their audiences.

Conclusion

Understanding brand perception has always been a central challenge in marketing and brand strategy. In the digital era, the challenge is no longer a lack of information, but an excess of it. Organisations now face a paradox: more data than ever, yet less clarity about what it actually means.

In a world where social media amplifies attention without always reshaping perception, the real strategic advantage lies in understanding the gap between visibility and meaning. The Brand Perception Atlas makes that gap visible. It shows where brands cluster, where they drift, and where surface-level conversation either reinforces or obscures deeper brand equity. In doing so, it helps brand leaders understand not just what people are saying today, but how those conversations connect to the deeper beliefs that shape brand meaning over time.

The real challenge is not simply tracking what a brand did yesterday. It is understanding what that brand means, what territory it occupies in people’s minds, and how difficult that territory is to shift.

Ready to explore the specifics? Read our full technical deep dive into the Brand Perception Atlas Pod for a closer look at our methodology.

Disclaimer: This content was created with AI assistance. All research and conclusions are the work of WPP Research.

March 27, 2026

Brand Perception Atlas Pod

Brand leaders today are navigating without a map. Social media says one thing, surveys say another, and AI-generated content adds yet another layer, leaving strategists with no reliable way to know whether their brand’s identity is consistent or not across channels. The Brand Perception Atlas solves this by bringing together five different sources of brand perception, TikTok, Instagram, Wikipedia, AI-generated summaries, and WPP’s 30-year Brand Asset Valuator® (BAV) survey, into a single, visual map covering over 200 brands and 4,000+ data points. By placing all of these different signals on one comparable scale, it makes it possible for the first time to see how a brand looks across platforms side by side. Several notable patterns emerge from the analysis, revealing that some brands, e.g. Brand D, an industrial equipment manufacturer, maintain a unified identity everywhere. While others, e.g. Brand G, a global hospitality company, occupy distinct perceptual territories across platforms, that may reflect a deliberate strategy to engage different audiences. The study also identifies “shared equity” pairs, revealing brands in different industries that perform the same emotional job for consumers, despite looking different online.

If you don’t care about the technical details, read our blog post instead. The GitHub repo is also coming soon.

The Brand Perception Atlas – a technical deep dive

The Brand Perception Atlas, is an interactive decision-support tool that helps brand teams understand, compare, and explain brand perception across platforms. It combines embedding-space visualisation (UMAP) with interpretable clusters, and cross-platform consistency scoring. The result is a tool that moves teams from “interesting maps” to recommendations that are better documented and easier to explain, because they link back to specific underlying perception signals and clearly show where sources agree or disagree. These can be used in brand reviews, competitor analysis, and campaign planning.

Goal: a newcomer can reproduce the core outcomes in 2–3 days. For full setup and running instructions, refer to the GitHub README (GitHub repo coming soon) — this walkthrough provides the conceptual map and domain knowledge needed to understand what the code does and why.

🧭 A note on what made this hard. The Brand Perception Atlas looks deceptively simple — embed text, project it, cluster it, display it. In reality, the single hardest problem was getting five fundamentally different perception sources to coexist in a shared space where distances actually mean something. WPP Brand Asset Valuator® (BAV) data comes from structured survey scores transformed through an LLM (Gemini) into prose. Social data comes from raw video transcripts and post captions, also transformed through an LLM (Gemini). Even though both pass through the same embedding model, the linguistic fingerprint of each source dominates the resulting vectors — the map would split cleanly into “survey-sounding text” vs “social-sounding text” rather than grouping brands by actual perception. Solving this required iterating through prompt normalisation, aggregation rebalancing, and ultimately Procrustes alignment — a technique borrowed from shape analysis that rotates one embedding subspace onto another using shared anchor brands. Section 6 tells this story in full.

1. Architecture overview

Key constraint: All five perception sources must be projected into a shared embedding space so that distances are comparable across sensors. The pipeline uses Procrustes alignment to rotate BAV vectors into the social subspace via overlapping anchor brands, then UMAP projects everything into 2D. The system maps the ideas behind the words, not just the text — “Luxury,” “Premium,” and “Prestigious” land in the same neighbourhood.

💡 Why this architecture isn’t obvious. A naive approach would be: embed everything → UMAP → cluster → done. The catch is that embedding models encode how something is said as much as what is said. A BAV narrative generated from “Helpful (45.2), Reliable (38.1)…” reads nothing like a TikTok perception report, even when both describe the same brand. Without the Procrustes step between “Ingest” and “UMAP,” the map would split by text style rather than brand perception. The arrows in this diagram look linear, but getting Step 3 (Aggregate) right took more iteration than every other step combined.

2. Setup & running

Full install and deploy instructions are maintained in the README. Below is a summary for orientation.

Public (GitHub ):

Mode	Description	API keys required?
Default	Run locally with provided toy dataset (20 brands, 681 reports) or own dataset in same CSV format	No
Advanced	Plug in own API keys for LLM clustering and embedding models	Yes (plug in own API keys)

Key dependencies: Python 3.12, lancedb, pandas, numpy, google-cloud-aiplatform, google-genai, tqdm, umap-learn, hdbscan, scikit-learn, streamlit

⚠️ Gotchas for newcomers:

Vertex AI quotas. The embed step hits gemini-embedding-001 in batches of 50. On a fresh Google Cloud Project (GCP) you may get rate-limited at ~60 requests/min. The pipeline handles retries, but if you see 429 errors, check your Vertex AI quota dashboard and request an increase before re-running.
UMAP is not deterministic. Runs with the same data can produce slightly different 2D layouts unless you pin random_state. The pipeline does pin it, but if you fork and forget, your clusters will shift between runs. However, relative distances between points do not change, so interpretation will remain the same.
LanceDB lock files. If a previous run crashed mid-write, LanceDB may leave a .lock file that blocks the next run. Delete *.lock files in the LanceDB directory if the pipeline hangs on startup.
uv sync vs pip install. The project uses uv for dependency management. If you install via pip instead, hdbscan and umap-learn can pull conflicting numpy versions. Stick with uv sync.

3. Data model

Table / Entity	Description
`brands`	Master list of 200+ brands (internal) or 20 brands (toy dataset) with metadata (industry, country)
`perception_signals`	One row per (brand × sensor) — raw text summaries from each source
`embeddings`	Semantic vector per perception signal — `gemini-embedding-001`, 768 dimensions
`umap_projections`	2D coordinates per embedding after UMAP reduction
`clusters`	Cluster ID, 3-word label (e.g. “Hope Innovation Compassion”), and member brands
`bav_attributes`	48 BAV imagery attribute scores per brand × 12 audience segments
`consistency_scores`	Omnichannel consistency % per brand (mean distance to centroid across sensors)

The data model below describes what lives in LanceDB after the pipeline runs. Each table feeds a different part of the dashboard UI. The key thing to understand: a single brand has multiple rows across these tables (one per sensor, one per audience segment, etc.). The perception map plots one dot per row in

umap_projections, not one dot per brand.

Toy dataset format (CSV, used by public GitHub default mode):

Column	Description
`Brand`	Brand name (e.g. Aetherium, Zenith Dynamics)
`Industry`	Industry category (Technology, Automotive, Food & Beverage, Retail, Healthcare, Finance, Entertainment)
`Platform`	Source sensor: TikTok (Brand Known), TikTok (Brand Unknown), Instagram (Brand Known), Instagram (Brand Unknown), Wikipedia, LLM, Survey
`Survey_Audience`	Demographic segment for Survey rows (e.g. Tech Early Adopters, Gen Z – Gamers); N/A for non-survey
`Brand_Perception_Report`	Free-text perception summary

4. Pipeline / workflow

The table below is the quick-reference version. Commentary after it explains what’s actually happening at each stage and where things can go wrong.

Phase	Key numbers
1. Preprocess	Instant runtime, merges into 1 unified DataFrame
2. Ingest (Embed)	`gemini-embedding-001`, 768 dims, batches of 50 via Vertex APIs
3. Aggregate	Procrustes rotation on anchor brands; UMAP `n_neighbors=min(15, len-1)`, `min_dist=0.1`, <5 sec
4. Cluster	HDBSCAN (`min_cluster_size=3/5`). Several minutes via Gemini Batch labelling queue
5. Consistency	`max(0.0, min(100.0, 100.0 - (mean_dist_to_centroid * 35.0)))`
6. BAV join	Procrustes alignment via rotation matrix on overlapping anchor brands
7. Atlas UI	Streamlit (instant runtime)

Phase-by-phase commentary

Phase 1 — Preprocess. Deceptively simple: merge CSVs, normalise brand names, filter junk. The hidden complexity is name matching. BAV uses official corporate names, social media uses colloquial names, and Wikipedia uses yet another variant. preprocess.py maintains a manual alias map for this. If you add a new brand and it doesn’t appear on the map, check the alias map first — it’s almost always a name mismatch.

Phase 2 — Ingest (embed). Each Brand_Perception_Report text gets turned into a 768-dimensional vector via gemini-embedding-001. The critical thing to understand: these vectors encode writing style as much as meaning. A BAV report that says “Helpful (45.2), Reliable (38.1)” and a TikTok report that says “this brand gives cozy reliable vibes” will land in different regions of embedding space even though they describe similar perceptions. This is the root cause of the domain shift problem solved in Phase 3.

Phase 3 — Aggregate (the hard one). This is where most of the iteration happened. Three things occur in sequence:

Social aggregation: Multiple post-level embeddings per (Brand × Platform) are mean-averaged into a single vector, and Gemini generates a summary report. This smoothing pulls social vectors toward a shared centroid.
Procrustes alignment: The BAV vectors are rotated into the social embedding subspace using 202 shared anchor brands (see Section 6 for the full story).
UMAP projection: The combined, aligned vectors are reduced to 2D. Only the 'All Adults' BAV slice is fitted alongside social platforms — this prevents the 12 BAV demographic segments from dominating the topology.

🔬 Why “balanced subset fit” matters. BAV has 12 audience segments per brand. Social has ~1–4 data points per brand. Without balancing, UMAP sees 12× more BAV points and builds its neighbourhood graph around BAV structure, marginalising social data. The fix: fit UMAP on the balanced subset (All Adults + social), then transform the remaining 11 BAV segments passively. This was a non-obvious but critical design choice.

Phase 4 — Cluster. HDBSCAN groups nearby points into perception themes, then Gemini labels each cluster with exactly 3 words. min_cluster_size is set to 3 (toy dataset) or 5 (full dataset). The batch labelling step submits each cluster’s centroid + top 5 most similar reports to Gemini. This can take several minutes because it goes through the Vertex AI Batch queue — don’t assume it hung.

Phase 5 — Consistency. A simple but effective metric: for each brand, compute the mean Euclidean distance from each sensor’s point to the brand’s centroid, then invert and scale. Brands where all sensors agree (Brand D, Brand E) score 99%+. Brands with platform-dependent perception (Brand G) score much lower. The * 35.0 scaling factor was empirically tuned to spread scores across a useful range.

Phase 6 — BAV join. Brings in the raw 48-attribute BAV scores per demographic segment. These are the structured numbers (not the Gemini-generated prose) and power the “Survey Audience” filter in the dashboard.

Phase 7 — Atlas UI. Streamlit renders everything from LanceDB. Instant startup because all computation was done in previous phases.

5. Atlas interface (primary interactions)

Focus brand selector → perception map + sidebar with cluster label, per-sensor summaries
Survey audience filter (BAV) → 12 demographic segments
Number of neighbours slider → controls perceptual neighbors on the map
Reference platform selector → changes cross-modal overlap anchor (Wikipedia, BAV, etc.)
Competitor set toggle → Show unexpected neighbours (out-of-industry brands)

🎯 What to look for when using the Atlas. The most interesting insights come from disagreements between sensors. If a brand’s BAV dot and TikTok dot are far apart, that’s a signal: the structured survey perception (what people say when asked directly) differs from the organic social perception (what people actually talk about). The Brand vs Content Effect tab adds another layer — when you hide the brand name from social content, does the perception shift? If so, the brand’s reputation is doing heavy lifting independent of the product itself.

6. Domain-specific mechanics

6.1 Why BAV is “ground truth”

Survey-based, 48 structured imagery attributes, 30+ years of longitudinal data, 12 demographic segments. BAV captures deep-seated beliefs shielded from daily social flux. Social algorithms change daily; survey-based trait grids collected over decades establish structured cognitive associations completely shielded from current hype timelines.

📖 For non-specialists: WPP (BAV) is one of the largest brand research databases in the world, maintained by WPP. It works by asking thousands of consumers to rate brands on 48 specific attributes — things like “Helpful,” “Innovative,” “Trustworthy” — scored on a numeric scale. Because the same questions are asked year after year across demographic segments, BAV gives you a stable, structured snapshot of how people think about a brand when prompted. Social media gives you what people spontaneously say. These are fundamentally different signals, and combining them is the core challenge of this project.

6.2 The BAV alignment problem – a full account

This section documents the central technical challenge of the Atlas and the iterative process that solved it.

The problem

When the Atlas was first built, the UMAP perception map split cleanly down the middle: all BAV dots on the left, all social dots on the right, regardless of whether they described the exact same brand. Brand A’s BAV point and Brand A’s TikTok point would be in completely different regions of the map. This made the entire visualisation useless for cross-platform comparison.

The separation was not evidence that BAV survey data captures genuinely different brand perceptions from social media. It was a methodological artefact caused by two compounding issues in the pipeline.

Root cause: different text domains fed to the same embedding model

All platforms are embedded with the same model (gemini-embedding-001), but the text being embedded is fundamentally different in style, vocabulary, and structure:

Platform	`Brand_Perception_Report` content	Source
BAV	LLM-generated narrative from 48 numerical imagery sensors, e.g. “Helpful (45.2), Reliable (38.1)…” → Gemini prompt → prose paragraph	`analysis.py`, `preprocess.py`
TikTok / Instagram	LLM-generated perception report from watching a single video/post	`preprocess.py`
Wikipedia	LLM-generated perception from Wikipedia article text	`preprocess.py`
LLM (Gemini)	Direct LLM perception (Gemini asked “what do you think of brand X?”)	Same as Wiki source file

The BAV text originates from a double LLM transformation: raw survey numbers → generate_semantic_statement() (structured string like “Full BAV Imagery Profile (48 Sensors): Helpful (45.2), Reliable (38.1), …”) → Gemini prompt → narrative paragraph. The social text comes from a single LLM step interpreting raw video/post content directly.

This means the embedding model sees completely different linguistic distributions for BAV vs social. The BAV narratives share a common templated style (always referencing “imagery sensors,” “quantitative data,” survey language) while social narratives use informal, media-oriented language. Embedding models encode how something is said as much as what is said, so this systematic style difference pushes BAV vectors into a distinct cluster regardless of actual brand perception agreement.

Compounding factor 1: aggregation asymmetry

In aggregate.py, social data (TikTok, Instagram) has multiple post-level embeddings that are mean-averaged per (Brand, Platform), and a Gemini summary replaces the report text. This smoothing pulls social vectors toward a shared centroid. BAV / Wikipedia / Gemini data (df_research) passes through as-is with post_count = 1 — no averaging occurs.

The result: social embeddings are inherently more “central” (mean-regression effect), while BAV embeddings retain their full individual variance. When UMAP runs on this combined set, the social vectors cluster tighter and the un-averaged BAV vectors spread out differently.

Compounding factor 2: UMAP sees the domain gap

UMAP is run on the entire combined dataset with n_neighbors=15. Because BAV embeddings share a systematic style signature different from social embeddings, UMAP‘s neighbourhood graph naturally groups them apart — it finds the text-style cluster, not a genuine perception cluster.

What we tried (in order)

🔬 Option A — normalise the text domain (first attempt). Rewrote the BAV perception report generation prompt in analysis.py to produce output that mimics the style of a social/Wikipedia perception report. Specifically: removed references to “BAV,” “imagery sensors,” “quantitative data” from the prompt. Used the same persona/format instructions as the social platform reports — describing what the brand feels like rather than referencing the data source. The prompt became: “Based on the following consumer perception data, write a concise paragraph describing how this brand is perceived by consumers. Focus on: overall vibe, what people praise, what people criticise, and who the typical customer is. Write as if describing public perception — do not reference the data source or format.”

Result: Reduced but did not eliminate the BAV/social separation. The templated numerical origin still leaked through in subtle ways.

✅Option B — Procrustes alignment (the solution). Use scipy.linalg.orthogonal_procrustes to align the BAV embedding subspace to the social subspace before combining. This preserves within-platform structure while removing the cross-platform domain shift. This is what the pipeline uses today.

Option C (embed a standardised perception schema across all platforms) remains a potential future improvement but requires significantly more work.

❌ ~~Option D — Per-platform z-score normalisation (ruled out).~~ Apply per-platform z-score normalisation to embedding vectors before UMAP, centring each platform’s distribution to zero-mean and unit-variance. This would remove the systematic offset but also mask any genuine platform-level differences — making it a workaround, not a proper fix. Discarded.

How Procrustes alignment works in the pipeline

📖For non-specialists: Procrustes alignment is a mathematical technique from shape analysis, named after a figure in Greek mythology who stretched or cut people to fit his bed. In our context, it “stretches” one cloud of data points to best overlap with another. Critically, it only uses rotation (spinning) and scaling — it doesn’t distort the internal relationships between points within each cloud. So the relative positions of BAV brands among themselves are preserved, but the entire BAV cloud is repositioned to overlap with the social cloud.

Here’s exactly what the code in aggregate.py does:

Finds the anchors. Identifies every brand that exists in both the BAV data and the social data (e.g., Brand A exists in both). In the production pipeline log, this found 202 anchor brands.
Computes the transformation.For each anchor brand, computes the mean social vector across all its social platforms, centres both the BAV anchor vectors and the social anchor vectors, and then uses scipy.linalg.orthogonal_procrustes to find the optimal rotation matrix R that maps BAV → social.
Applies the rotation. Multiplies all BAV vectors (even brands that didn’t have social data) by this rotation matrix, moving the entire BAV dataset into the social media spatial domain.
Logs quality. Reports the Frobenius residual so alignment quality is traceable.
Safeguard. If fewer than 10 shared brands exist, alignment is skipped with a warning — Procrustes is unreliable with too few anchors.

How this changes interpretation of the dashboard

This alignment profoundly upgrades what you can conclude from the perception map:

True cross-platform comparisons. If a BAV dot and a TikTok dot for “Brand B” sit right next to each other, it now genuinely implies that the core sentiment in the structured survey data closely matches the organic social conversations. Before Procrustes, proximity between BAV and social points was meaningless.
Distances have semantic meaning. By forcibly removing the structural domain shift, any remaining distance between two points is entirely due to a difference in meaning and perception. If the BAV point for a brand is far from its Instagram point, you can confidently analyse that gap as a genuine difference in audience perception or marketing strategy — not just an artefact of different text formatting.
Unified clustering. When HDBSCAN runs over this aligned space, it can finally cluster BAV reports together with social reports. The LLM-generated theme labels now encompass insights drawn from both quantitative surveys and viral videos simultaneously.

6.3 UMAP parameter sensitivity

min_dist=0.1, n_components=2, metric='cosine'. Balanced subset fit: only 'All Adults' BAV slice is fitted alongside social platforms, preventing 12 BAV audiences from overpowering topology. UMAP was chosen over t-SNE because it enables saving the reducer object — the pipeline strictly fits on one balanced subset and passively transforms newly injected demographics.

⚠️ Watch out: Changing min_dist has outsized effects on the map. Lower values (e.g., 0.01) create tighter, more separated clusters — visually dramatic but can split genuinely related brands. Higher values (e.g., 0.5) spread everything into a uniform blob. The current 0.1 was chosen as a balance after visual inspection across multiple brand sets. If you change it, re-check whether brands with known perceptual similarity (e.g., brands in the same industry) still land in the same neighbourhood.

6.4 Cluster labelling

Automated via LLM. Vertex AI Batch submits centroid + top 5 reports (by cosine similarity) to Gemini → exactly 3 words. The 3-word constraint forces abstraction — “Hope Innovation Compassion” rather than a paragraph. If labels feel wrong, the issue is almost always that the cluster itself is incoherent (check HDBSCAN‘s min_cluster_size), not that the LLM mislabelled it.

6.5 Omnichannel consistency

100.0 - (mean Euclidean distance * 35.0), clamped 0–100%. Tight overlaps hit 99%+ (Brand D, Brand E), dispersed shifts drop fast (Brand G). The * 35.0 multiplier is an empirically tuned scaling factor — if you add new sensors or change the embedding model, you may need to recalibrate it so scores distribute meaningfully across 0–100%.

6.6 Content vs brand effect methodology

Tracks shift_2d (Euclidean magnitude), cos_shift (cosine diff between brand known/brand unknown), and BAV baseline deltas (bav_delta_known vs bav_delta_unk). Parses exact LLM cluster words added/lost due to brand awareness.

💡 Why this matters. Social media perception reports are generated from video/post content. When the brand name is visible, the LLM’s perception is coloured by everything it “knows” about that brand. When the brand name is hidden, the LLM can only react to what it actually sees in the content. The delta between these two tells you how much of a brand’s social perception is driven by brand reputation vs actual content quality. Large shifts indicate the brand name is doing heavy lifting.

7. Module map

The codebase is intentionally small. Every module does one thing. If you’re debugging, start by identifying which phase failed (check the CLI output), then go straight to the corresponding file.

brand_perception/dashboard/atlas_pipeline/
├── main.py              (43 lines)    CLI entry point
├── dashboard_v1.py      (~1133 lines) Core Streamlit frontend
└── src/pipeline/
    ├── preprocess.py    (130 lines)   Sanitizes, normalises, filters into LanceDB schemas
    ├── ingest.py        (126 lines)   GenAI models → 768-D embeddings
    ├── aggregate.py     (303 lines)   Procrustes, LLM reports, UMAP layouts
    └── cluster.py       (244 lines)   HDBSCAN groups + batch cluster labels

Total: ~1,979 lines across 6 modules.

📖 Where the complexity lives. The line counts understate the complexity. aggregate.py at 303 lines is where 80% of the intellectual difficulty sits — it handles social aggregation, Procrustes alignment, and UMAP projection. dashboard_v1.py at ~1,133 lines is the largest file but is mostly Streamlit layout code. If you’re onboarding, read aggregate.py first; it’s where the science meets the engineering.

8. Test coverage

~6 functional / integration tests. Runtime <1 minute.

Test file	Purpose	Count
`scripts/test_bav_pipeline.py`	BAV baseline ingestion flow	1
`dev/test_apify*`	Social Scraper (TikTok/Instagram)	2
`brand_perception/api/test_agent.py`	Job Orchestrator backend queues	2
`research/test_scrape_jh.py`	Manual methodology mockups	1

9. Design decisions

These are the answers to questions that came up during development where the wrong choice would have broken the system.

ID	Decision	Rationale	What happens if you reverse it
DD-BPA-1	UMAP over t-SNE	Better distance proportionality; enables saving `reducer` to fit on balanced subset and transform new demographics	t-SNE can’t transform new points — you’d have to re-run the entire projection every time a new BAV demographic segment is added, and distances between clusters become meaningless
DD-BPA-2	BAV as ground-truth anchor	Survey-based trait grids over decades, shielded from daily social hype	Using social as ground truth would anchor the map to volatile, algorithm-dependent signals — the map would shift with every TikTok trend cycle
DD-BPA-3	Semantic embeddings over keywords	Captures meaning (“Luxury” ≈ “Premium” ≈ “Prestigious”)	Keyword-based approaches treat “Luxury” and “Premium” as unrelated tokens — brands described with different vocabulary but identical perception would never cluster together
DD-BPA-4	Procrustes alignment	Solves text heterogeneity (surveys vs social) via rotation on anchor brands. Prompt normalisation alone (Option A) reduced but did not eliminate the domain gap	Without it, the map splits by text style (BAV left, social right) rather than by brand perception — see Section 6.2 for the full account
DD-BPA-5	Local first public release (no API keys required)	Lowers barrier to entry; toy dataset enables immediate exploration without infrastructure	Requiring API keys upfront would prevent most people from ever trying the tool — the toy dataset lets someone see the full Atlas UI in under 5 minutes

10. Extending the System

The pipeline was designed to be extended, each of these is a realistic next step, listed in order of effort.

Run with custom data — Format your own dataset as CSV matching the toy dataset schema (Brand, Industry, Platform, Survey_Audience, Brand_Perception_Report), drop into data directory, run default mode. This is the zero-effort way to test the Atlas on a new domain.
Add a sensor — Collect as CSV/Parquet, import in preprocess.py under FINAL_COLUMN_ORDER (Super_Platform, Year, Brand, Raw_Text), run main.py. The Procrustes alignment will automatically include the new sensor in its anchor calculation if the new sensor shares brands with existing sources.
Add a market — Reroll BAV datasets in ./paths, override GCS env vars, rerun batches. Note that Procrustes alignment quality depends on having enough shared anchor brands between BAV and social data — if you enter a market where BAV coverage is thin, check the Frobenius residual in the logs.
Temporal tracking — LanceDB already stores Year and BAV_Study; add slide toggle in dashboard_v1.py. This would let you see how a brand’s perception drifts over time across sensors — one of the most requested features.

⚠️ If you add a new sensor: remember that the Procrustes alignment currently rotates BAV into the social subspace specifically. If your new sensor has a similarly distinct text style (e.g., Reddit comments vs TikTok captions), you may see a new domain gap. In that case, consider extending the alignment step to handle multiple source-target pairs, or grouping sensors into “formal” and “informal” categories for alignment.

11. Results (end-to-end validation)

Internal coverage: 200+ brands, 4,000+ data points across 5 modalities, 12 demographic segments
Public toy dataset: 20 brands, 681 perception reports across 7 industries, all 5 sensor types + brand known/brand unknown variants
Cross-industry insight validation:
- Omnichannel consistency: Brand D, Brand E at 99%+; Brand G identified as multi-faceted
- Shared equity, different vibe: Brand H ↔ Brand G (close on BAV, far on socials)
- Different equity, shared vibe: Brand K ↔ Brand L (far on BAV, converged on socials)
Validation metrics: Procrustes Residuals for subspace overlap + brand known/brand unknown cosine similarity differentials

💡How to read these results. The “shared equity, different vibe” and “different equity, shared vibe” patterns are the most commercially interesting findings. They reveal cases where a brand’s formal positioning (BAV) disagrees with its organic social presence — exactly the kind of insight that’s invisible to either data source alone. The Atlas’s value proposition is making these cross-modal disagreements visible and quantifiable.

🔗 Repositories

GitHub: (GitHub repo coming soon)

March 24, 2026