Author: Nikos Gkikizas

  • Learning through experience: teaching the viral Hermes agent to automate our work

    Learning through experience: teaching the viral Hermes agent to automate our work

    Hermes

    What is Hermes ?

    Hermes Agent is an open-source, self-hosted AI agent released by Nous Research, the lab behind the Hermes model family under an MIT license. Its main pitch is a built-in learning loop. Instead of resetting to zero every session, Hermes runs a post-execution review after each successful task, distills the steps that worked into a reusable, Markdown-defined “skill” and refines those skills the next time it hits a similar problem. It also keeps persistent memory across sessions, so it gradually builds a model of your projects and how you like things done, effectively learning through experience.

    Unlike a copilot tethered to an IDE, Hermes is meant to live on a server and run unattended – a $5 VPS, a GPU box or serverless infra that costs almost nothing when idle. It talks to hundreds of LLMs through the OpenAI-compatible interface, can communicate via Telegram, Discord, Slack, WhatsApp, Signal, email and a CLI, supports natural-language cron for scheduled jobs, and can spin up subagents to parallelize work. It also ships with 40+ built-in skills out of the box.

    What made it especially attractive to us is its ability to write and improve its own playbook. This ability made us think: could it learn how to do our job and fully automate the work we do at the Quick Reactions pod?

    Our pod’s mission is to pick up state-of-the-art AI tools, experiment with them, and write an honest, evidence-backed assessment.

    What makes our work tricky is the fact that every new tool we evaluate is different. We need to study it, figure out how to set it up, run it, and evaluate the results.

    Our 6-step evaluation protocol

    We begin by encoding our workflow into a strict, 6-step protocol that the agent can follow for every evaluation:

    1. Workspace initialization – spin up a clean, isolated project environment so each evaluation starts from a known state.
    2. Baseline replication – run the tool’s own “getting started” examples first, to confirm the headline claims reproduce before we push further.
    3. Rigorous verification – design and run additional autonomous tests that probe the tool under conditions its authors didn’t pick, rather than extrapolating from the happy path.
    4. Data synthesis & metrics control – measure the things that matter (recall, latency, accuracy) against ground truth data.
    5. Adversarial peer review – hand the findings to a separate agent powered by a powerful LLM (Claude Opus 4.8), to receive feedback and iterate.
    6. Finalization & delivery – format the assessment doc and ship it to the right channel (to our internal Notion knowledge base in our case).

    The idea was simple: if these are the steps a human pod member walks through, can an agent walk through them unattended – and would the writeup at the end be any good?

    The Skill system: how Hermes improves itself

    The first time we ran Hermes on a real task, we walked it through the 6-step protocol by hand. Instead of just following along, Hermes wrote each step down as its own skill – a short Markdown file it can pull up at the start of any future run. So the protocol stopped being something we had to repeatedly provide as input; it became something the agent already knows.

    Hermes used this knowledge to build a small set of skills covering everything we do: how to set up a clean workspace, how to test a new tool properly, and how to draft the write-up and run it past the reviewers. On top of those sits one master skill that holds the whole run together – it treats the 6 steps as a checklist and won’t let Hermes jump ahead before the previous step is actually done.

    Impressively, when the LLM peer reviewer flagged something during an experiment (e.g. an unfair baseline, a missing caveat, a poorly designed experiment), Hermes would learn from the feedback and address the issue. It would also record its new learnings in the related skill files, so the next experiment started from a slightly stronger playbook.

    That’s the self-improvement loop the Hermes pitch promises, and we actually watched it happen. The more experiments we run, the better the skills get, and the less we need to babysit Hermes for the next one. The underlying LLM that powers Hermes (Gemini 3.1 Pro) isn’t getting smarter – its playbook is, and Hermes is the one rewriting it.

    Putting Hermes to the test

    We gave Hermes a single, real assignment: take a brand-new open-source tool called Turbovec – which claims to store huge amounts of data in a tiny amount of memory and search it faster than the popular alternative – and find out whether those claims actually hold up.

    We handed the agent the tool, its documentation, a bare cloud machine to work, and nothing else: no starter code, no template, no outline. Hermes had to decide what and how to test, run the experiments and, write the whole thing up on its own.

    We reviewed Hermes’ output in the exact same way we would review a human colleague’s work, via three simple questions:

    • Did it manage to set up the tool and run experiments? Yes! It wasn’t all smooth sailing. Hermes’s first pass used the wrong settings. One of the integrations that TurboVec advertised also didn’t work on the first try – a common challenge we face in the Quick Reactions Pod. However, rather than getting blocked by these stumbles, the agent noticed them, fixed them and left a clear trail of what went wrong and how it was corrected – exactly the kind of thing a rushed human reviewer might quietly skip over.
    • Did it design a fair test? Yes! It first reproduced the tool authors’ own results, then set up an even-handed comparison against the leading alternative tool (FAISS), as a baseline. It was also careful enough to optimize the baseline tool’s configuration (rather than making it deliberately weak one), ensuring that the contest wasn’t rigged in Turbovec’s favor.
    • Did it get the numbers right? Yes, with a bit of extra AI help. For inststance, its first attempt used a small data sample and then just assumed the results would scale up neatly. The Adversarial peer-review step (step 5 in our 6-step workflow) caught that this assumption was unsafe. Hermes accepted the criticism and re-ran the full-size test. The adversarial reviewer turned out to be right – using the small sample would have significantly skewed the results.
    • Was the writeup appropriate? Yes, with a bit of extra AI help. Hermes’ original draft omitted some critical details and also inflated some of the findings. Thankfully, The LLM reviewer’s feedback in step 5 also ensured that claims got toned down to what the data actually supported.

    So is it worth it?

    Very promising. Left alone with a new GitHub repo and a blank machine, Hermes handled the mechanical, time-consuming work on its own: it downloaded the repo, installed everything, and ran some initial tests to make sure everything runs.

    Even though it did stumble during the actual experimental design and assessment of the tool (TurboVec), the introduction of our second Reviewer agent was enough to address these issues and deliver, with no human in the loop.

    Obviously this is just a single – albeit very promising – piece of evidence. We will keep pushing the limits of Hermes in the context of our pod’s work, with the intent to automate and scale-up our assessment efforts as much as possible.

    Another angle we plan to explore is cost minimization. This first experiment showed the effectiveness of the iterative, dual-agent architecture:

    • An affordable Gemini-powered Hermes to actually do the heavy lifting (open-ended, token-heavy)
    • A more expensive Opus-powered Reviewer to review the report after each iteration and provide feedback (single-shot, token-lean)

    A key question – and one that we keep facing in WPP Research – is: what is the cheapest LLM brain that we could use for each agent, while maintaining quality outcome?

  • A 50-line python function outperformed every frontier LLM – With 100% accuracy

    A 50-line python function outperformed every frontier LLM – With 100% accuracy

    The experimental setup

    We created a simple framework for testing an LLM’s reasoning capacity in a multi-step scenario. It comprised an engine that creates consistent logical rules. For example, a rule could be: If the given number is divisible by 14, add 231 to it and pass it to rule 100. Otherwise, create a new number by adding the digits of the given number and pass it to rule 31. We also created a deterministic way of parsing and iterating through rules, using basic python programming.

    To understand what a run looks like, suppose that the random ruleset we created for a single trial consists of the following 5 rules:

    • Rule 1. If the given number is greater than 61, get the absolute value and pass it to rule 2. Otherwise get the absolute value and pass it to rule 1
    • Rule 2. If the given number is greater than 339, add 354 and pass it to rule 3. Otherwise your new value is the sum of digits ignoring sign and pass it to rule 4
    • Rule 3. If the given number is divisible by 274, subtract 274 and pass it to rule 5. Otherwise get the absolute value and pass it to rule 5
    • Rule 4. If the given number is greater than 431, multiply by 199 and pass it to rule 2. Otherwise your new value is the sum of digits ignoring sign and pass it to rule 2
    • Rule 5. If the given number is divisible by 110, get the absolute value and pass it to rule 1. Otherwise add 487 and pass it to rule 4

    Now suppose your initial value is 500 and you start from rule 2, for a total of 2 iterations.

    For the first iteration, rule 2 says that, if your value is greater than 339 (which is true), you must add 354 (result 854) and pass it to rule 3.

    For the second iteration, rule 3 says that, if the given number (854) is divisible by 274, then you need to subtract 274 and pass it to rule 5. Otherwise (which is our case since 854 is not divisible by 274), get the absolute value (854) and pass it to rule 5.

    Finally, we end up with a final value of 854.

    Our full experimental design was as follows:

    • Generate N logical rules
    • Pick a random rule to serve as the starting rule.
    • Sample a random number of iterations to perform, from 10 to 100. The task ends when all iterations are complete, at which point the current numerical value is reported.

    We repeated the above experiments for various values of N (randomly sampled between 10 and 10000).

    We then deterministically calculated the correct result and compared it to the response given by two LLM-based agents:

    • Normal Agent: No access to tools. The entire set of rules was provided to the agent during the first interaction, to hold in its context window.
    • Tool Agent: Given access to two tools: one for deterministically fetching a rule at a specific index (e.g. “go to rule 56”) and one giving it the ability to write and execute python code snippets.

    We used different LLMs as the brains for the above agents: opus 4.6 and sonnet 4.6 from anthropic, gemini 2.5 pro and gemini 2.5 flash by Google, and deepseek-v4-pro by DeepSeek AI.

    62.7% accuracy – and that was the good arm

    The Tool Agent was on average, as expected, more accurate than the Normal one, with an accuracy of 62.7% versus 52.9%, respectively.

    The per model breakdown is revealing. For the Normal agent, deepseek-v4-pro is a clear leader with an accuracy of 86.7%, higher even that the Tool agent with the same model as the LLM Brain.

    Experimental results per model and experiment mode

    All the other models perform better when placed inside the Tool Agent. The largest gains are observed by gemini-2.5-flash, whose accuracy jumps from 26.7% to 66.7% (from Normal to Tool-based). The gains are much less noticeable in the rest of the models.

    In the agentic mode, both the total number of rules and the maximum number of iterations seem to be negatively correlated with the probability of the model producing a correct result (p-values of 0.05 and 0.003) respectively. More specifically, for the total number of rules in the trial, for every 100 rules the probability of the LLM providing a correct result is reduced by ~7%, while for every additional number of maximum iterations the probability is reduced by ~2%.

    Poor “reasoning” choices

    Perhaps the most interesting part of the experiment was diving into the reasoning logs of models in the agentic setup. There you can notice some strange reasoning patterns and some questionable choices, when it comes to tool calling.

    For example, here are some python evaluations that opus 4.6 executed. Some are pretty reasonable, like:

    • checking divisibility of large integers: {"eval":"39310614 // 2517"} or
    • safely summing the digits of a number: {"eval":"sum(int(d) for d in str(abs(1790)))"}

    Others are a bit weird, but you could still accept them as an overly safe practice, like:

    • subtracting a positive integer from 0: {"eval":"0 - 4478"} or
    • checking the maximum digit of a two digit number: {"eval":"max(int(d) for d in str(13))"}

    Unfortunately, many are nonsensical, like:

    • checking the result of dividing zero by any number {"eval":"0 // 9699"}
    • checking the absolute value of a non-negative, single digit integer: {"eval":"abs(0)"} and {"eval":"abs(1)"}
    • double-checking 0 added to any number {"eval":"0 + 1790"}
    • getting the sum of digits of 0: {"eval":"sum(int(d) for d in str(abs(0)))"}

    Such nonsensical choices are of course not unique to opus-4.6. Here are some similar ones from the rest of the models:

    • gemini 2.5 pro: {"eval": "0 * 5227"}, {"eval": "0 * 8451"}, {"eval": "sum(int(d) for d in str(0))"}, {"eval": "abs(0)"}, {"eval": "0 * 439"}
    • sonnet 4.6: {"eval":"max(int(d) for d in str(abs(2)))"}, {"eval":"0 == 6204"},{"eval":"1 < 6866"}, {"eval":"1 == 9248"}, {"eval":"sum(1 for d in str(0) if int(d) % 2 == 0)"}
    • gemini 2.5 flash: {"eval": "min(int(digit) for digit in str(2))"}, {"eval": "0 * 451"}
    • deepseek-v4-pro: {"eval": "int(max(str(0)))"}, {"eval":"int(max(str(0)))"}, {"eval":"0+815"}, {"eval": "int(min(str(abs(3))))"}

    The logs are literally swamped with such choices, which are not a result of a prompt like “always use the tools to check your math”. On the contrary, the directive in the prompt was to call the python tool only “if you want to evaluate a short, one-line expression in python”.

    Why do LLMs ace everything except anything new?

    Understanding why LLMs fail in simple but novel tasks is very difficult, but it is consistent with what the literature suggests. The Arc-AGI-3 benchmark reveals that the success rate of the highest performing commercial LLMs in completing novel tasks that the average human can easily complete is less than 1%.

    LLM-based systems are extremely efficient in semantically retrieving information, in a revolutionary way. That is why many people feel empowered when they first get their hands on tools like Claude Code or Codex. Using them, it’s now trivial to create a simple web page or a small app.

    However, the reason why this happens is likely more related to information retrieval than to genuine, innovative (out-of-distribution) “thinking”, despite what the news headlines suggest. In other words, whenever Claude, Codex or Antigravity prototype a nice, working website it’s highly likely that the code it produced, or most of it, already existed in a similar form in its training set.

    That becomes obvious after claims like the innovative kernel exploit Mythos uncovered that turned out to be an exact copy of Kerberos CVE, written in 2007. In other words, the fact that something appears in the 15th page of Google Search, which makes it practically indiscoverable for the average researcher, doesn’t mean that it’s not useful for an LLM in generating a “novel” solution. Rediscovery due to inaccessible old sources is a well-studied phenomenon in science and LLMs can actually help reduce that.

    Don’t give your credit card to something that computes abs(0)

    First of all, providing unsupervised access to LLM-based (agentic) systems can be very dangerous. It’s not the wisest thing to grant full access to your laptop or credit card to something that needs to double check the absolute value of 0 or the max digit of 1. The dangers of using LLMs in critical applications can be seen in various articles that demonstrate what can go wrong. Guardrails must be used to defend against the possibility of losing 2.5 years worth of customer data or permanently deleting your production database.

    Secondly, our results reinforce the fact that in many cases, there’s no need for an “agentic” solution. In our example, creating a python function that parses the rules and executes them took just a few minutes. The execution of the function has an average runtime of 100ms whereas the average LLM solution took anywhere from 25s to more than a minute. More importantly, the traditional system had a 100% success rate vs the average 62.7% of the “agentic” mode. As for the cost, the average session was about 30k tokens, that with a cost of $5/million tokens was about 15 cents per query – so infinitely more expensive and prone to errors.

  • MiroFish: Is swarm intelligence worth the cloud bill?

    MiroFish: Is swarm intelligence worth the cloud bill?

    What is MiroFish

    MiroFish is an open-source multi-agent simulation framework released on GitHub by BaiFu. The pitch is that a swarm of LLM-driven agents, each with its own persona, will outperform a single one-shot LLM call on open-ended questions where there is no obviously correct chain of reasoning. Agents are seeded from a free-text brief, given a generated domain ontology and made to debate across many rounds before the system synthesizes a final answer. We wanted to see whether the swarm justifies its complexity.

    The system ships as a Docker stack: a Python backend, a web frontend and a Zep Cloud integration for persistent agent memory. It talks to LLMs through the OpenAI-compatible chat-completions interface, which means almost any provider can be plugged in by changing two environment variables (we used Google’s Gemini).

    A screen capture of one of the experiments showing the entity taxonomy the tool created.

    A run has two stages. First, the user uploads a free-text seed and presses Start Engine. MiroFish parses the seed, asks the LLM to generate a domain ontology (entities, relationships, attributes etc.) and derives the number of agents from how many entities the ontology contains, so the swarm is sized to the problem rather than chosen by the user. Second, the user sets the number of debate rounds and runs the simulation: agents talk to each other under the ontology, read from and write to Zep. In the end the system synthesizes a single answer in the requested format. The intended use cases are questions where many perspectives plausibly disagree but a single answer is required, like market and policy forecasting, strategic planning or qualitative research synthesis.

    Our test setup

    We wanted a clean, time-boxed evaluation with an objective ground truth, so we picked same-day S&P 500 prediction. Each morning before the 09:30 ET open we asked MiroFish two questions:

    • Q1: will the S&P 500 close higher or lower than yesterday’s close
    • Q2: which five S&P 500 names will be the day’s top percentage gainers?

    Both questions are settled by the closing print six and a half hours later. Both are hard, and both let us compare the swarm against a single-shot LLM call on the same inputs. To keep the comparison clean, we ran a control arm in parallel: identical seed, identical prompt, sent to a single gemini-2.5-flash chat-completions call with no swarm, no memory, no tools.

    MiroFish does not browse the web or pull in data on its own – the seed is the only input the agents have to work with, so the creators say it should be a comprehensive, free-text brief covering everything relevant to the question you want the swarm to answer. Each morning we had to compile a ~4,000-character seed.txt summarizing the pre-open state of the world:

    • Prior session closes (S&P 500, Nasdaq, Dow, Russell 2000, VIX, 10Y yield, WTI and Brent crude, gold, Bitcoin),
    • The key drivers behind yesterday’s moves, according to the news
    • The macro overhang (US–Iran war, Trump–Xi summit, hot April CPI and PPI prints),
    • Today’s economic calendar
    • Overnight Asian and European action
    • US futures
    • Sentiment indicators (CNN Fear & Greed, AAII, Robinhood prediction-market pricing on ES strikes)
    • notable single-name news.

    The prompt asked the swarm to converge on exactly two lines:

    • DIRECTION: <UP or DOWN> against the previous close
    • TOP 5: <T1>, <T2>, <T3>, <T4>, <T5> of S&P 500 tickers expected to be the day’s largest percentage gainers, unordered.

    Our initial intention was to run the experiment for a few consecutive trading days in May 2026, giving us some independent direction calls and sets of five tickers from each arm. However, things did not go exactly as planned.

    Setting Up MiroFish

    Getting MiroFish to produce a single usable prediction turned out to be a multi-day debugging exercise. The published Docker image is stale – it ships an older build with a Chinese-only interface and no English option, so we had to rebuild it from source before we could even read the UI. Once we got past that, the system crashed on startup every time we uploaded our seed: MiroFish was built and tested against a specific Chinese LLM provider (Alibaba’s Qwen) and its LLM client uses a hardcoded max-tokens parameter that is too low for the ontology response Gemini produces. The output gets truncated mid-JSON, which naturally fails parsing. Fixing that required either bumping the token limit in the backend code or switching to simpler/faster/low reasoning model, which we finally opted for (used a rather old but good, lightweight model, namely gemini-2.5-flash-lite).

    With the engine finally running, we discovered that you cannot choose how many agents participate in the simulation – the system decides for you based on how many entities it extracts from your seed text. We wanted 100 agents; we got around 25. The only way to get more is to stuff the seed with more names, which means you cannot separate “give the swarm more context” from “make the swarm bigger.” On top of that, the free tier of Zep Cloud – the memory service the agents depend on – ran out of quota after just three runs, killing the simulation mid-run with no way to recover. Zep is a hard dependency with no option to swap it out or run without it, which makes the framework’s viability entirely contingent on a third-party SaaS quota.

    The most telling limitation was what happened to our actual predictions. We asked two questions: market direction (up or down) and a list of five S&P 500 names most likely to be the day’s top percentage gainers. MiroFish answered the first and ignored the second – the final report replaced our requested ticker list with vague sector commentary like “defensives are likely to outperform.”

    The plain Gemini control arm, given the exact same seed and prompt in a single call with no simulation, answered both questions cleanly every time.

    When both arms did produce a direction call, they agreed – suggesting the swarm added no information the underlying model didn’t already have on its own.

    Finally, MiroFish has no ability to look up live information: agents reason only from the uploaded seed and whatever the LLM remembers from training, so we had to hand-compile all market data ourselves each morning. The “prediction” is only as current as the seed you write.

    Prediction quality – Wednesday 14 May 2026

    The only trading day where we got a clean end-to-end run was May 14. Both arms received an identical seed compiled before the 09:30 ET open, summarizing where stocks finished the day before (the S&P 500 closed at 7,444.25 on May 13), the higher-than-expected inflation data released earlier that week, rising oil prices and the trend of investors shifting money into safer, more defensive sectors.

    A snapshot for the generated report for May 14th.

    Q1 – Direction. Both the MiroFish swarm and the plain Gemini control arm predicted DOWN. The S&P 500 closed at 7,501.24, up 0.77% – both were wrong.

    Q2 – Top 5 tickers. Contrary to the pattern we saw in earlier runs, the swarm did produce a ticker list this time: NVDA, AMZN, META, GOOGL, MSFT – five mega-cap names that read more like a list of the largest S&P 500 constituents by market cap than an attempt at predicting the day’s biggest movers. The control arm picked WMT, BABA, DE, AMAT, CAVA – a more varied selection but equally untethered from the actual outcome. The day’s real top five gainers were CSCO (+13.4%), JBHT (+7.1%), APP (+7.0%), TTWO (+6.8%) and F (+6.7%). Neither arm placed a single name in the actual top five.

    To put the picks in context, we ranked where each predicted ticker actually finished relative to the rest of the S&P 500 on the day (1.0 = best performer, 0.0 = worst). The swarm’s picks landed at the 96th, 71st, 47th, 29th and 16th percentiles; the control arm’s at the 67th, 62nd, 17th and 16th (BABA and CAVA are not S&P 500 constituents, so they could not be scored at all – the control arm hallucinated two of its five picks). Both sets are scattered across the distribution with no concentration near the top – exactly what you would expect from random selection, not informed prediction.

    One day is not a verdict on anything. But as a tool evaluation it told us what we needed to know: after days of debugging, the framework produced a single directional prediction that was (a) wrong, (b) identical to what a plain API call returned and (c) completely off on the ticker question. The swarm added complexity without adding information.

    So is it worth it?

    MiroFish’s tagline is “Predict Anything.” That is an ambitious claim and our experience suggests it gets ahead of where the framework actually is. The idea behind it – many LLM-driven agents debating a question from different angles before converging on an answer – is genuinely interesting and there may well be problem domains where that kind of structured disagreement surfaces insights a single model call would miss: scenario planning, policy deliberation, qualitative research synthesis. But the implementation is not ready to deliver on the premise. The setup is fragile, the dependency on Zep’s free tier makes sustained experimentation impractical and the system offers little control over core parameters like agent count. When we did get a complete run, the swarm’s prediction matched what a single API call to the same model produced – same direction call, same lack of accuracy on tickers – suggesting that the multi-agent overhead added no new signal in our test. One trading day is far too small a sample to draw sweeping conclusions and a fairer test would use a domain where diverse perspectives matter more than quantitative precision. As for the cloud bill in our title: because we ended up on gemini-2.5-flash-lite, an older and lightweight model, the entire experiment cost us around $1. Still, for anyone considering MiroFish today, the gap between the tagline and the out-of-the-box experience is wide enough to warrant caution.

  • Inside the MemPalace: Does the structure earn its keep?

    Inside the MemPalace: Does the structure earn its keep?


    MemPalace is an open-source, local-first AI memory system that went viral after actress Milla Jovovich released it on GitHub and shared it on her personal accounts (Milla’s Insta reel). We wanted to see what’s behind the hype.

    In typical chats, an AI forgets anything that doesn’t fit in its context window. A common workaround is to summarize past conversations and index the summaries in a flat vector database (flat meaning there’s no hierarchy). The problem is that summarization results in loss of information. The specific names, numbers and offhand preferences might get eliminated, so when you later reference one of those details the system has nothing concrete to retrieve so you have to re-explain. To combat this MemPalace stores the exact conversation and every project file verbatim, so the original wording is always available to semantic search.

    MemPalace ships with a built-in MCP server exposing 29 tools. Perhaps the most important tools are mempalace_search, that performs a search across the palace or in specific wings/rooms and mempalace_status that returns an overview of the entire palace.


    The Palace hierarchy

    The framework uses the Palace hierarchy, a metaphor from the ancient Greek method of loci. As an example imagine a freelance designer with two clients, a bakery called “Sweet Rise” and a fitness app called “FitLoop”. The Palace is the entire file system (all your stored memories).

    Wings sit at the top level and represent major entities – a person, a project, or a domain:

    • Wing Sweet Rise – everything related to the bakery project
    • Wing FitLoop – everything related to the fitness app

    Rooms live inside a wing and correspond to specific topics:

    • Wing Sweet Rise – Rooms: brand identity, packaging, website
    • Wing Fitloop – Rooms: onboarding flow, brand-identity, push-notifications

    Halls are conceptual categories that describe how memories relate: facts, events, discoveries, preferences or advice:

    • Facts hall – storing things like “the primary brand color is #F4A261”
    • Preferences hall – “The client prefers Arial over Serif fonts”.
    • Decisions hall – “We decided to go for a CAD drawn logo on March 19”.

    Tunnels (cross-wing connections) – i.e. both Sweet Rise and FitLoop have a brand-identity Room, so MemPalace would create a tunnel connecting the two so you could answer questions like “What are the major learnings regarding brand identity from all my projects”.

    Drawers hold the original verbatim text chunks and serve as the primary retrieval unit. For example a Drawer inside the “Preferences” Hall of the Sweet Rise Wing would contain “Client said: ‘We absolutely hate Serif as a font. Don’t use it anywhere!’.”

    Closets sit above drawers as an optional summary layer that points back to the underlying verbatim content. A closet of the above Drawer in the Sweet Rise Wing would say “Sweet rise has clearly stated it prefers non-corporate fonts.”.


    How we tested it

    We tested MemPalace on a public benchmark called RAGBench, specifically its HotPotQA test split. This benchmark contains a set of questions and the exact passages that contain the answer.

    To make the dataset compatible with MemPalace, we saved it in the form of documents. Then we loaded the documents into MemPalace, asked every test question and checked how often the correct sentences showed up in MemPalace’s top 5 results.

    To see whether MemPalace was actually doing something useful, we compared it against a simple vector-search baseline: a standard “find the most similar text by meaning” search (using a popular off the shelf embedding model called all-MiniLM-L6-v2) running over the exact same documents. Neither system used a language model at retrieval time, so the comparison is purely about how well each one finds the right information and how fast.


    The results

    The results suggest that MemPalace’s retrieval accuracy is probably overstated and that its hierarchy doesn’t seem to help, at least with the default settings. The project claims 96.6% Recall@5, but on RAGBench we measured only 83.8%. The plain vector search baseline scored 84.8% on the same data, beating MemPalace by 1% without any hierarchical structure at all.

    In other words, the simplest thing you could possibly build (take every document, embed it with a small open source model and look up the closest matches) outperformed a system whose entire pitch is that hierarchy and structure make retrieval better. If a flat baseline wins on a standard benchmark, then the structure is either not pulling its weight or only pulling it on the specific instances or types of data MemPalace was developed against.

    One important caveat is that we used MemPalace out of the box, with default settings and no per-dataset tuning. It is plausible that with a different chunking strategy, a stronger embedding model or hand tuned mining and retrieval parameters, the numbers would improve, possibly substantially.

    But that is also true of the baseline and the point of an out of the box test is to see what a user actually gets when they pick up the project and try it. On that test, MemPalace did not beat a few line vector search script and the published 96.6% number did not hold up.

  • Quick Reactions Pod

    Quick Reactions Pod

    We monitor AI news – new models, libraries and tools – test them rapidly, and publish a short reaction on our blog.

    How we work 

    No deep dives – the goal is speed and signal, not exhaustive analysis. We get the thing up and running with minimal setup, test it against the main use cases it’s designed for, and write a short blog post capturing first impressions, what worked, what didn’t, and whether it’s worth a closer look.

    Our most recent work

    Learning through experience: teaching the viral Hermes agent to automate our work

    A 50-line python function outperformed every frontier LLM – With 100% accuracy

    MiroFish: Is swarm intelligence worth the cloud bill?

    Autoresearch: a closer look at the agent that runs its own experiments

    Inside the MemPalace: Does the structure earn its keep?

    OpenClaw for messaging: a closer look at WhatsApp and Telegram