{"id":1541,"date":"2026-05-19T14:52:36","date_gmt":"2026-05-19T14:52:36","guid":{"rendered":"https:\/\/cms.research.wpp.com\/?p=1541"},"modified":"2026-05-20T09:14:10","modified_gmt":"2026-05-20T09:14:10","slug":"mirofish-is-swarm-intelligence-worth-the-cloud-bill","status":"publish","type":"post","link":"https:\/\/cms.research.wpp.com\/?p=1541","title":{"rendered":"MiroFish: Is swarm intelligence worth the cloud bill?"},"content":{"rendered":"\n<h3 class=\"wp-block-heading has-large-font-size\">What is MiroFish<\/h3>\n\n\n\n<p class=\"has-medium-font-size\"><a href=\"https:\/\/github.com\/666ghj\/MiroFish\"><strong>MiroFish<\/strong><\/a> is an open-source multi-agent simulation framework released on GitHub by <a href=\"https:\/\/github.com\/666ghj\/MiroFish\">BaiFu<\/a>. The pitch is that a swarm of LLM-driven agents, each with its own persona, will outperform a single one-shot LLM call on open-ended questions where there is no obviously correct chain of reasoning. Agents are seeded from a free-text brief, given a generated domain ontology and made to debate across many rounds before the system synthesizes a final answer. We wanted to see whether the swarm justifies its complexity.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">The system ships as a Docker stack: a Python backend, a web frontend and a Zep Cloud integration for persistent agent memory. It talks to LLMs through the OpenAI-compatible chat-completions interface, which means almost any provider can be plugged in by changing two environment variables (we used Google&#8217;s Gemini).<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"551\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/image-3-1024x551.png\" alt=\"\" class=\"wp-image-1542\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/image-3-1024x551.png 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/image-3-300x162.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/image-3-768x413.png 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/image-3-1536x827.png 1536w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/image-3-2048x1103.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">A screen capture of one of the experiments showing the entity taxonomy the tool created.<\/figcaption><\/figure>\n\n\n\n<p class=\"has-medium-font-size\">A run has two stages. First, the user uploads a free-text seed and presses <code>Start Engine<\/code>. MiroFish parses the seed, asks the LLM to generate a domain ontology (entities, relationships, attributes etc.) and derives the number of agents from how many entities the ontology contains, so the swarm is sized to the problem rather than chosen by the user. Second, the user sets the number of debate rounds and runs the simulation: agents talk to each other under the ontology, read from and write to Zep. In the end the system synthesizes a single answer in the requested format. The intended use cases are questions where many perspectives plausibly disagree but a single answer is required, like market and policy forecasting, strategic planning or qualitative research synthesis.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Our test setup<\/h3>\n\n\n\n<p class=\"has-medium-font-size\">We wanted a clean, time-boxed evaluation with an objective ground truth, so we picked <strong>same-day S&amp;P 500 prediction<\/strong>. Each morning before the 09:30 ET open we asked MiroFish two questions:<\/p>\n\n\n\n<ul class=\"wp-block-list has-medium-font-size\">\n<li>Q1: will the S&amp;P 500 close higher or lower than yesterday&#8217;s close<\/li>\n\n\n\n<li>Q2: which five S&amp;P 500 names will be the day&#8217;s top percentage gainers?<\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size\">Both questions are settled by the closing print six and a half hours later. Both are hard, and both let us compare the swarm against a single-shot LLM call on the same inputs. To keep the comparison clean, we ran a control arm in parallel: identical seed, identical prompt, sent to a single gemini-2.5-flash chat-completions call with no swarm, no memory, no tools.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">MiroFish does not browse the web or pull in data on its own &#8211; the seed is the only input the agents have to work with, so the creators say it should be a comprehensive, free-text brief covering everything relevant to the question you want the swarm to answer. Each morning we had to compile a ~4,000-character seed.txt summarizing the pre-open state of the world:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"has-medium-font-size\">Prior session closes (S&amp;P 500, Nasdaq, Dow, Russell 2000, VIX, 10Y yield, WTI and Brent crude, gold, Bitcoin),<\/li>\n\n\n\n<li class=\"has-medium-font-size\">The key drivers behind yesterday&#8217;s moves, according to the news<\/li>\n\n\n\n<li class=\"has-medium-font-size\">The macro overhang (US\u2013Iran war, Trump\u2013Xi summit, hot April CPI and PPI prints),<\/li>\n\n\n\n<li class=\"has-medium-font-size\">Today&#8217;s economic calendar<\/li>\n\n\n\n<li class=\"has-medium-font-size\">Overnight Asian and European action<\/li>\n\n\n\n<li class=\"has-medium-font-size\">US futures<\/li>\n\n\n\n<li class=\"has-medium-font-size\">Sentiment indicators (CNN Fear &amp; Greed, AAII, Robinhood prediction-market pricing on ES strikes)<\/li>\n\n\n\n<li class=\"has-medium-font-size\">notable single-name news.<\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size\">The prompt asked the swarm to converge on exactly two lines:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"has-medium-font-size\">DIRECTION: &lt;UP or DOWN&gt; against the previous close<\/li>\n\n\n\n<li class=\"has-medium-font-size\">TOP 5: &lt;T1&gt;, &lt;T2&gt;, &lt;T3&gt;, &lt;T4&gt;, &lt;T5&gt; of S&amp;P 500 tickers expected to be the day&#8217;s largest percentage gainers, unordered.<\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size\">Our initial intention was to run the experiment for a few consecutive trading days in May 2026, giving us some independent direction calls and sets of five tickers from each arm. However, things did not go exactly as planned.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Setting Up MiroFish<\/h3>\n\n\n\n<p class=\"has-medium-font-size\">Getting MiroFish to produce a single usable prediction turned out to be a multi-day debugging exercise. The published Docker image is stale &#8211; it ships an older build with a Chinese-only interface and no English option, so we had to rebuild it from source before we could even read the UI. Once we got past that, the system crashed on startup every time we uploaded our seed: MiroFish was built and tested against a specific Chinese LLM provider (Alibaba&#8217;s Qwen) and its LLM client uses a hardcoded max-tokens parameter that is too low for the ontology response Gemini produces. The output gets truncated mid-JSON, which naturally fails parsing. Fixing that required either bumping the token limit in the backend code or switching to simpler\/faster\/low reasoning model, which we finally opted for (used a rather old but good, lightweight model, namely <code>gemini-2.5-flash-lite<\/code>).<\/p>\n\n\n\n<p class=\"has-medium-font-size\">With the engine finally running, we discovered that <strong>you cannot choose how many agents participate in the simulation<\/strong> &#8211; the system decides for you based on how many entities it extracts from your seed text. We wanted 100 agents; we got around 25. The only way to get more is to stuff the seed with more names, which means you cannot separate &#8220;give the swarm more context&#8221; from &#8220;make the swarm bigger.&#8221; On top of that, the free tier of Zep Cloud &#8211; the memory service the agents depend on &#8211; ran out of quota after just three runs, killing the simulation mid-run with no way to recover. Zep is a hard dependency with no option to swap it out or run without it, which makes the framework&#8217;s viability entirely contingent on a third-party SaaS quota.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">The most telling limitation was what happened to our actual predictions. We asked two questions: market direction (up or down) and a list of five S&amp;P 500 names most likely to be the day&#8217;s top percentage gainers. MiroFish answered the first and <strong>ignored the second<\/strong> &#8211; the final report replaced our requested ticker list with vague sector commentary like &#8220;defensives are likely to outperform.&#8221;<\/p>\n\n\n\n<p class=\"has-medium-font-size\">The plain Gemini control arm, given the exact same seed and prompt in a single call with no simulation, answered both questions cleanly every time.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">When both arms did produce a direction call, they agreed &#8211; suggesting the swarm added no information the underlying model didn&#8217;t already have on its own.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">Finally, MiroFish has no ability to look up live information: agents reason only from the uploaded seed and whatever the LLM remembers from training, so we had to hand-compile all market data ourselves each morning. The &#8220;prediction&#8221; is only as current as the seed you write.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Prediction quality &#8211; Wednesday 14 May 2026<\/h3>\n\n\n\n<p class=\"has-medium-font-size\">The only trading day where we got a clean end-to-end run was May 14. Both arms received an identical seed compiled before the 09:30 ET open, summarizing where stocks finished the day before (the S&amp;P 500 closed at 7,444.25 on May 13), the higher-than-expected inflation data released earlier that week, rising oil prices and the trend of investors shifting money into safer, more defensive sectors.<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"579\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/31A5A13D-683E-46C9-8FBC-C8C4B9C39DB3-1024x579.png\" alt=\"\" class=\"wp-image-1546\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/31A5A13D-683E-46C9-8FBC-C8C4B9C39DB3-1024x579.png 1024w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/31A5A13D-683E-46C9-8FBC-C8C4B9C39DB3-300x170.png 300w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/31A5A13D-683E-46C9-8FBC-C8C4B9C39DB3-768x434.png 768w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/31A5A13D-683E-46C9-8FBC-C8C4B9C39DB3-1536x869.png 1536w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/31A5A13D-683E-46C9-8FBC-C8C4B9C39DB3-2048x1158.png 2048w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption class=\"wp-element-caption\">A snapshot for the generated report for May 14th.<\/figcaption><\/figure>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Q1 &#8211; Direction.<\/strong> Both the MiroFish swarm and the plain Gemini control arm predicted <strong>DOWN<\/strong>. The S&amp;P 500 closed at <strong>7,501.24<\/strong>, up 0.77% &#8211; both were wrong.<\/p>\n\n\n\n<p class=\"has-medium-font-size\"><strong>Q2 &#8211; Top 5 tickers.<\/strong> Contrary to the pattern we saw in earlier runs, the swarm did produce a ticker list this time: <strong>NVDA, AMZN, META, GOOGL, MSFT<\/strong> &#8211; five mega-cap names that read more like a list of the largest S&amp;P 500 constituents by market cap than an attempt at predicting the day&#8217;s biggest movers. The control arm picked <strong>WMT, BABA, DE, AMAT, CAVA<\/strong> &#8211; a more varied selection but equally untethered from the actual outcome. The day&#8217;s real top five gainers were <strong>CSCO (+13.4%), JBHT (+7.1%), APP (+7.0%), TTWO (+6.8%) and F (+6.7%)<\/strong>. Neither arm placed a single name in the actual top five.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">To put the picks in context, we ranked where each predicted ticker actually finished relative to the rest of the S&amp;P 500 on the day (1.0 = best performer, 0.0 = worst). The swarm&#8217;s picks landed at the 96th, 71st, 47th, 29th and 16th percentiles; the control arm&#8217;s at the 67th, 62nd, 17th and 16th (BABA and CAVA are not S&amp;P 500 constituents, so they could not be scored at all &#8211; the control arm hallucinated two of its five picks). Both sets are scattered across the distribution with no concentration near the top &#8211; exactly what you would expect from random selection, not informed prediction.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">One day is not a verdict on anything. But as a tool evaluation it told us what we needed to know: after days of debugging, the framework produced a single directional prediction that was (a) wrong, (b) identical to what a plain API call returned and (c) completely off on the ticker question. The swarm added complexity without adding information.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">So is it worth it?<\/h3>\n\n\n\n<p class=\"has-medium-font-size\">MiroFish&#8217;s tagline is &#8220;<strong>Predict Anything<\/strong>.&#8221; That is an ambitious claim and our experience suggests it gets ahead of where the framework actually is. The idea behind it &#8211; many LLM-driven agents debating a question from different angles before converging on an answer &#8211; is genuinely interesting and there may well be problem domains where that kind of structured disagreement surfaces insights a single model call would miss: scenario planning, policy deliberation, qualitative research synthesis. But the implementation is not ready to deliver on the premise. The setup is fragile, the dependency on Zep&#8217;s free tier makes sustained experimentation impractical and the system offers little control over core parameters like agent count. When we did get a complete run, the swarm&#8217;s prediction matched what a single API call to the same model produced &#8211; same direction call, same lack of accuracy on tickers &#8211; suggesting that the multi-agent overhead added no new signal in our test. One trading day is far too small a sample to draw sweeping conclusions and a fairer test would use a domain where diverse perspectives matter more than quantitative precision. As for the cloud bill in our title: because we ended up on <code>gemini-2.5-flash-lite<\/code>, an older and lightweight model, the entire experiment cost us around $1. Still, for anyone considering MiroFish today, the gap between the tagline and the out-of-the-box experience is wide enough to warrant caution.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We tested MiroFish, an open-source multi-agent swarm framework that promises to outpredict single LLM calls by having dozens of AI agents debate each other. We pointed it at same-day S&#038;P 500 forecasting with an objective ground truth. After days of debugging a stale Docker image, hardcoded token limits and a third-party memory service that ran out of free quota in three runs, the swarm produced one usable prediction: it called the market direction wrong, picked tickers no better than random and at no point diverged from what a plain single-shot Gemini call returned for free in seconds. The core idea is interesting, but the implementation is not there yet.<\/p>\n","protected":false},"author":15,"featured_media":1551,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"tags":[],"ppma_author":[{"id":15,"display_name":"Nikos Gkikizas","first_name":"Nikos","last_name":"Gkikizas","nickname":"nikos.gkikizas","user_nicename":"nikos-gkikizas","user_email":"nikos.gkikizas@satalia.com","biographical_info":"Nikos is a Staff Data Scientist at Satalia. He started off as a mining engineer and later specialized in explosives and blasting engineering. Through signal processing he was introduced to data science, with which he gradually fell in love with. Over the last decade he's been working in Data Science, having developed various simulation, predictive and optimization solutions.","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/04\/signal-2026-04-01-111218.jpeg","job_title":"Staff Data Scientist","is_lead":false,"display_as_researcher":true,"order_priority":null}],"class_list":["post-1541","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry"],"acf":{"related_pods":[1362],"featured":false},"authors":[{"term_id":31,"user_id":15,"is_guest":0,"slug":"nikos-gkikizas","display_name":"Nikos Gkikizas","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/04\/signal-2026-04-01-111218.jpeg","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":"","biographical_info":"Nikos is a Staff Data Scientist at Satalia. He started off as a mining engineer and later specialized in explosives and blasting engineering. Through signal processing he was introduced to data science, with which he gradually fell in love with. Over the last decade he's been working in Data Science, having developed various simulation, predictive and optimization solutions."}],"featured_image_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/image-13.png","featured_image_sizes":{"thumbnail":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/image-13-150x150.png","medium":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/image-13-300x300.png","large":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/image-13.png","full":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/image-13.png"},"_links":{"self":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/posts\/1541","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/users\/15"}],"replies":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1541"}],"version-history":[{"count":10,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/posts\/1541\/revisions"}],"predecessor-version":[{"id":1557,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/posts\/1541\/revisions\/1557"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/media\/1551"}],"wp:attachment":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1541"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1541"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fppma_author&post=1541"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}