Inside the MemPalace: Does the structure earn its keep?

MemPalace is an open-source, local-first AI memory system that went viral after actress Milla Jovovich released it on GitHub and shared it on her personal accounts (Milla’s Insta reel). We wanted to see what’s behind the hype.

In typical chats, an AI forgets anything that doesn’t fit in its context window. A common workaround is to summarize past conversations and index the summaries in a flat vector database (flat meaning there’s no hierarchy). The problem is that summarization results in loss of information. The specific names, numbers and offhand preferences might get eliminated, so when you later reference one of those details the system has nothing concrete to retrieve so you have to re-explain. To combat this MemPalace stores the exact conversation and every project file verbatim, so the original wording is always available to semantic search.

MemPalace ships with a built-in MCP server exposing 29 tools. Perhaps the most important tools are mempalace_search, that performs a search across the palace or in specific wings/rooms and mempalace_status that returns an overview of the entire palace.

The Palace hierarchy

The framework uses the Palace hierarchy, a metaphor from the ancient Greek method of loci. As an example imagine a freelance designer with two clients, a bakery called “Sweet Rise” and a fitness app called “FitLoop”. The Palace is the entire file system (all your stored memories).

Wings sit at the top level and represent major entities – a person, a project, or a domain:

Wing Sweet Rise – everything related to the bakery project
Wing FitLoop – everything related to the fitness app

Rooms live inside a wing and correspond to specific topics:

Wing Sweet Rise – Rooms: brand identity, packaging, website
Wing Fitloop – Rooms: onboarding flow, brand-identity, push-notifications

Halls are conceptual categories that describe how memories relate: facts, events, discoveries, preferences or advice:

Facts hall – storing things like “the primary brand color is #F4A261”
Preferences hall – “The client prefers Arial over Serif fonts”.
Decisions hall – “We decided to go for a CAD drawn logo on March 19”.

Tunnels (cross-wing connections) – i.e. both Sweet Rise and FitLoop have a brand-identity Room, so MemPalace would create a tunnel connecting the two so you could answer questions like “What are the major learnings regarding brand identity from all my projects”.

Drawers hold the original verbatim text chunks and serve as the primary retrieval unit. For example a Drawer inside the “Preferences” Hall of the Sweet Rise Wing would contain “Client said: ‘We absolutely hate Serif as a font. Don’t use it anywhere!’.”

Closets sit above drawers as an optional summary layer that points back to the underlying verbatim content. A closet of the above Drawer in the Sweet Rise Wing would say “Sweet rise has clearly stated it prefers non-corporate fonts.”.

How we tested it

We tested MemPalace on a public benchmark called RAGBench, specifically its HotPotQA test split. This benchmark contains a set of questions and the exact passages that contain the answer.

To make the dataset compatible with MemPalace, we saved it in the form of documents. Then we loaded the documents into MemPalace, asked every test question and checked how often the correct sentences showed up in MemPalace’s top 5 results.

To see whether MemPalace was actually doing something useful, we compared it against a simple vector-search baseline: a standard “find the most similar text by meaning” search (using a popular off the shelf embedding model called all-MiniLM-L6-v2) running over the exact same documents. Neither system used a language model at retrieval time, so the comparison is purely about how well each one finds the right information and how fast.

The results

The results suggest that MemPalace’s retrieval accuracy is probably overstated and that its hierarchy doesn’t seem to help, at least with the default settings. The project claims 96.6% Recall@5, but on RAGBench we measured only 83.8%. The plain vector search baseline scored 84.8% on the same data, beating MemPalace by 1% without any hierarchical structure at all.

In other words, the simplest thing you could possibly build (take every document, embed it with a small open source model and look up the closest matches) outperformed a system whose entire pitch is that hierarchy and structure make retrieval better. If a flat baseline wins on a standard benchmark, then the structure is either not pulling its weight or only pulling it on the specific instances or types of data MemPalace was developed against.

One important caveat is that we used MemPalace out of the box, with default settings and no per-dataset tuning. It is plausible that with a different chunking strategy, a stronger embedding model or hand tuned mining and retrieval parameters, the numbers would improve, possibly substantially.

But that is also true of the baseline and the point of an out of the box test is to see what a user actually gets when they pick up the project and try it. On that test, MemPalace did not beat a few line vector search script and the published 96.6% number did not hold up.

Author

Nikos Gkikizas

Nikos is a Staff Data Scientist at Satalia. He started off as a mining engineer and later specialized in explosives and blasting engineering. Through signal processing he was introduced to data science, with which he gradually fell in love with. Over the last decade he’s been working in Data Science, having developed various simulation, predictive and optimization solutions.

The Palace hierarchy

How we tested it

The results

Author

Comments

Leave a Reply Cancel reply

More posts

Agree. Transact. Verify.

Learning through experience: teaching the viral Hermes agent to automate our work

A 50-line python function outperformed every frontier LLM – With 100% accuracy

MiroFish: Is swarm intelligence worth the cloud bill?