{"id":1558,"date":"2026-05-27T08:42:38","date_gmt":"2026-05-27T08:42:38","guid":{"rendered":"https:\/\/cms.research.wpp.com\/?p=1558"},"modified":"2026-05-27T08:42:39","modified_gmt":"2026-05-27T08:42:39","slug":"a-50-line-python-function-outperformed-every-frontier-llm-with-100-accuracy","status":"publish","type":"post","link":"https:\/\/cms.research.wpp.com\/?p=1558","title":{"rendered":"A 50-line python function outperformed every frontier LLM &#8211; With 100% accuracy"},"content":{"rendered":"\n<p class=\"has-medium-font-size\"><strong>The experimental setup<\/strong><\/p>\n\n\n\n<p class=\"has-medium-font-size\">We created a simple framework for testing an LLM&#8217;s reasoning capacity in a multi-step scenario. It comprised an engine that creates consistent logical rules. For example, a rule could be: <code>If the given number is divisible by 14, add 231 to it and pass it to rule 100. Otherwise, create a new number by adding the digits of the given number and pass it to rule 31<\/code>. We also created a deterministic way of parsing and iterating through rules, using basic python programming.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">To understand what a run looks like, suppose that the random ruleset we created for a single trial consists of the following 5 rules:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"has-medium-font-size\"><code>Rule 1. If the given number is greater than 61, get the absolute value and pass it to rule 2. Otherwise get the absolute value and pass it to rule 1<\/code><\/li>\n\n\n\n<li class=\"has-medium-font-size\"><code>Rule 2. If the given number is greater than 339, add 354 and pass it to rule 3. Otherwise your new value is the sum of digits ignoring sign and pass it to rule 4<\/code><\/li>\n\n\n\n<li class=\"has-medium-font-size\"><code>Rule 3. If the given number is divisible by 274, subtract 274 and pass it to rule 5. Otherwise get the absolute value and pass it to rule 5<\/code><\/li>\n\n\n\n<li class=\"has-medium-font-size\"><code>Rule 4. If the given number is greater than 431, multiply by 199 and pass it to rule 2. Otherwise your new value is the sum of digits ignoring sign and pass it to rule 2<\/code><\/li>\n\n\n\n<li class=\"has-medium-font-size\"><code>Rule 5. If the given number is divisible by 110, get the absolute value and pass it to rule 1. Otherwise add 487 and pass it to rule 4<\/code><\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size\">Now suppose your initial value is 500 and you start from rule 2, for a total of 2 iterations.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">For the first iteration, rule 2 says that, if your value is greater than 339 (which is true), you must add 354 (result 854) and pass it to rule 3.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">For the second iteration, rule 3 says that, if the given number (854) is divisible by 274, then you need to subtract 274 and pass it to rule 5. Otherwise (which is our case since 854 is not divisible by 274), get the absolute value (854) and pass it to rule 5.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">Finally, we end up with a final value of 854.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">Our full experimental design was as follows:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"has-medium-font-size\">Generate N logical rules<\/li>\n\n\n\n<li class=\"has-medium-font-size\">Pick a random rule to serve as the starting rule.<\/li>\n\n\n\n<li class=\"has-medium-font-size\">Sample a random number of iterations to perform, from 10 to 100. The task ends when all iterations are complete, at which point the current numerical value is reported.<\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size\">We repeated the above experiments for various values of N (randomly sampled between 10 and 10000).<\/p>\n\n\n\n<p class=\"has-medium-font-size\">We then deterministically calculated the correct result and compared it to the response given by two LLM-based agents:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"has-medium-font-size\">Normal Agent: No access to tools. The entire set of rules was provided to the agent during the first interaction, to hold in its context window.<\/li>\n\n\n\n<li class=\"has-medium-font-size\">Tool Agent: Given access to two tools: one for deterministically fetching a rule at a specific index (e.g. \u201cgo to rule 56\u201d) and one giving it the ability to write and execute python code snippets.<\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size\">We used different LLMs as the brains for the above agents: <code>opus 4.6<\/code> and <code>sonnet 4.6<\/code> from anthropic, <code>gemini 2.5 pro<\/code> and <code>gemini 2.5 flash<\/code> by Google, and <code>deepseek-v4-pro<\/code> by DeepSeek AI.<\/p>\n\n\n\n<h3 class=\"wp-block-heading has-medium-font-size\">62.7% accuracy &#8211; and that was the good arm<\/h3>\n\n\n\n<p class=\"has-medium-font-size\">The Tool Agent was on average, as expected, more accurate than the Normal one, with an accuracy of 62.7% versus 52.9%, respectively.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">The per model breakdown is revealing. For the Normal agent, <code>deepseek-v4-pro<\/code> is a clear leader with an accuracy of 86.7%, higher even that the Tool agent with the same model as the LLM Brain.<\/p>\n\n\n\n<figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"660\" height=\"499\" src=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/results.png\" alt=\"\" class=\"wp-image-1559\" srcset=\"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/results.png 660w, https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/results-300x227.png 300w\" sizes=\"auto, (max-width: 660px) 100vw, 660px\" \/><figcaption class=\"wp-element-caption\">Experimental results per model and experiment mode<\/figcaption><\/figure>\n\n\n\n<p class=\"has-medium-font-size\">All the other models perform better when placed inside the Tool Agent. The largest gains are observed by <code>gemini-2.5-flash<\/code>, whose accuracy jumps from 26.7% to 66.7% (from Normal to Tool-based). The gains are much less noticeable in the rest of the models.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">In the agentic mode, both the total number of rules and the maximum number of iterations seem to be negatively correlated with the probability of the model producing a correct result (p-values of 0.05 and 0.003) respectively. More specifically, for the total number of rules in the trial, for every 100 rules the probability of the LLM providing a correct result is reduced by ~7%, while for every additional number of maximum iterations the probability is reduced by ~2%.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Poor &#8220;reasoning&#8221; choices<\/strong><\/h3>\n\n\n\n<p class=\"has-medium-font-size\">Perhaps the most interesting part of the experiment was diving into the reasoning logs of models in the agentic setup. There you can notice some strange reasoning patterns and some questionable choices, when it comes to tool calling.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">For example, here are some python evaluations that <code>opus 4.6<\/code> executed. Some are pretty reasonable, like:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"has-medium-font-size\">checking divisibility of large integers: <code>{\"eval\":\"39310614 \/\/ 2517\"}<\/code> or<\/li>\n\n\n\n<li class=\"has-medium-font-size\">safely summing the digits of a number: <code>{\"eval\":\"sum(int(d) for d in str(abs(1790)))\"}<\/code><\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size\">Others are a bit weird, but you could still accept them as an overly safe practice, like:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"has-medium-font-size\">subtracting a positive integer from 0: <code>{\"eval\":\"0 - 4478\"}<\/code> or<\/li>\n\n\n\n<li class=\"has-medium-font-size\">checking the maximum digit of a two digit number: <code>{\"eval\":\"max(int(d) for d in str(13))\"}<\/code><\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size\">Unfortunately, many are nonsensical, like:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"has-medium-font-size\">checking the result of dividing zero by any number <code>{\"eval\":\"0 \/\/ 9699\"}<\/code><\/li>\n\n\n\n<li class=\"has-medium-font-size\">checking the absolute value of a non-negative, single digit integer: <code>{\"eval\":\"abs(0)\"}<\/code> and <code>{\"eval\":\"abs(1)\"}<\/code><\/li>\n\n\n\n<li class=\"has-medium-font-size\">double-checking 0 added to any number <code>{\"eval\":\"0 + 1790\"}<\/code><\/li>\n\n\n\n<li class=\"has-medium-font-size\">getting the sum of digits of 0: <code>{\"eval\":\"sum(int(d) for d in str(abs(0)))\"}<\/code><\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size\">Such nonsensical choices are of course not unique to <code>opus-4.6<\/code>. Here are some similar ones from the rest of the models:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"has-medium-font-size\"><code>gemini 2.5 pro<\/code>: <code>{\"eval\": \"0 * 5227\"}<\/code>, <code>{\"eval\": \"0 * 8451\"}<\/code>, <code>{\"eval\": \"sum(int(d) for d in str(0))\"}<\/code>, <code>{\"eval\": \"abs(0)\"}<\/code>, <code>{\"eval\": \"0 * 439\"}<\/code><\/li>\n\n\n\n<li class=\"has-medium-font-size\"><code>sonnet 4.6<\/code>: <code>{\"eval\":\"max(int(d) for d in str(abs(2)))\"}<\/code>, <code>{\"eval\":\"0 == 6204\"}<\/code>,<code>{\"eval\":\"1 &lt; 6866\"}<\/code>, <code>{\"eval\":\"1 == 9248\"}<\/code>, <code>{\"eval\":\"sum(1 for d in str(0) if int(d) % 2 == 0)\"}<\/code><\/li>\n\n\n\n<li class=\"has-medium-font-size\"><code>gemini 2.5 flash<\/code>: <code>{\"eval\": \"min(int(digit) for digit in str(2))\"}<\/code>, <code>{\"eval\": \"0 * 451\"}<\/code><\/li>\n\n\n\n<li class=\"has-medium-font-size\"><code>deepseek-v4-pro<\/code>: <code>{\"eval\": \"int(max(str(0)))\"}<\/code>, <code>{\"eval\":\"int(max(str(0)))\"}<\/code>, <code>{\"eval\":\"0+815\"}<\/code>, <code>{\"eval\": \"int(min(str(abs(3))))\"}<\/code><\/li>\n<\/ul>\n\n\n\n<p class=\"has-medium-font-size\">The logs are literally swamped with such choices, which are not a result of a prompt like &#8220;always use the tools to check your math&#8221;. On the contrary, the directive in the prompt was to call the python tool only &#8220;if you want to evaluate a short, one-line expression in python&#8221;.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Why do LLMs ace everything except anything new?<\/strong><\/h3>\n\n\n\n<p class=\"has-medium-font-size\">Understanding why LLMs fail in simple but novel tasks is very difficult, but it is consistent with what the literature suggests. The Arc-AGI-3 benchmark reveals that the <a href=\"https:\/\/arcprize.org\/blog\/arc-agi-3-gpt-5-5-opus-4-7-analysis\">success rate of the highest performing commercial LLMs in completing novel tasks that the average human can easily complete is less than 1%<\/a>.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">LLM-based systems are extremely efficient in semantically retrieving information, in a revolutionary way. That is why many people feel empowered when they first get their hands on tools like Claude Code or Codex. Using them, it\u2019s now trivial to create a simple web page or a small app.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">However, the reason why this happens is likely more related to information retrieval than to genuine, innovative (out-of-distribution) &#8220;thinking&#8221;, despite what the news headlines suggest. In other words, whenever Claude, Codex or Antigravity prototype a nice, working website it&#8217;s highly likely that the code it produced, or most of it, already existed in a similar form in its training set.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">That becomes obvious after claims like the innovative kernel exploit Mythos uncovered that <a href=\"https:\/\/www.linkedin.com\/posts\/resilientcyber_mythos-magic-or-training-data-influence-activity-7458560679915159552-tP86\/\">turned out to be an exact copy of Kerberos CVE, written in 2007<\/a>. In other words, the fact that something appears in the 15th page of Google Search, which makes it practically indiscoverable for the average researcher, doesn\u2019t mean that it\u2019s not useful for an LLM in generating a &#8220;novel&#8221; solution. <a href=\"https:\/\/dash.harvard.edu\/server\/api\/core\/bitstreams\/2775c669-4a55-4774-818a-27740374cf95\/content\">Rediscovery due to inaccessible old sources is a well-studied phenomenon in science<\/a> and LLMs can actually help reduce that.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Don\u2019t give your credit card to something that computes abs(0)<\/h3>\n\n\n\n<p class=\"has-medium-font-size\">First of all, providing unsupervised access to LLM-based (agentic) systems can be very dangerous. It\u2019s not the wisest thing to grant full access to your laptop or credit card to something that needs to double check the absolute value of 0 or the max digit of 1. The dangers of using LLMs in critical applications can be seen in various articles that demonstrate what can go wrong. Guardrails must be used to defend against the possibility of <a href=\"https:\/\/lyrie.ai\/research\/research\/2026-04-28-claude-code-terraform-destroy-autonomous-breach\">losing 2.5 years worth of customer data<\/a> or <a href=\"https:\/\/www.theguardian.com\/technology\/2026\/apr\/29\/claude-ai-deletes-firm-database\">permanently deleting your production database<\/a>.<\/p>\n\n\n\n<p class=\"has-medium-font-size\">Secondly, our results reinforce the fact that in many cases, there\u2019s no need for an &#8220;agentic&#8221; solution. In our example, creating a python function that parses the rules and executes them took just a few minutes. The execution of the function has an average runtime of 100ms whereas the average LLM solution took anywhere from 25s to more than a minute. More importantly, the traditional system had a 100% success rate vs the average 62.7% of the &#8220;agentic&#8221; mode. As for the cost, the average session was about 30k tokens, that with a cost of $5\/million tokens was about 15 cents per query &#8211; so infinitely more expensive and prone to errors.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>We gave five frontier LLMs a simple but novel multi-step task: follow a chain of randomly generated logical rules, each applying a basic arithmetic operation to a number and routing to the next rule, for a given number of iterations. We tested it with and without agentic tools. Tools helped on average (62.7% vs 52.9%), but DeepSeek-v4-pro hit 86.7% without them. The agentic logs were full of nonsensical tool calls &#8211; like computing the absolute value of 0 &#8211; while a plain Python function solved the same task in a few milliseconds with 100% accuracy. LLMs remain powerful retrieval engines, but on genuinely novel multi-step reasoning tasks they are still far from dependable.<\/p>\n","protected":false},"author":15,"featured_media":1564,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"tags":[],"ppma_author":[{"id":15,"display_name":"Nikos Gkikizas","first_name":"Nikos","last_name":"Gkikizas","nickname":"nikos.gkikizas","user_nicename":"nikos-gkikizas","user_email":"nikos.gkikizas@satalia.com","biographical_info":"Nikos is a Staff Data Scientist at Satalia. He started off as a mining engineer and later specialized in explosives and blasting engineering. Through signal processing he was introduced to data science, with which he gradually fell in love with. Over the last decade he's been working in Data Science, having developed various simulation, predictive and optimization solutions.","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/04\/signal-2026-04-01-111218.jpeg","job_title":"Staff Data Scientist","is_lead":false,"display_as_researcher":true,"order_priority":null}],"class_list":["post-1558","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry"],"acf":{"related_pods":[1362],"featured":false},"authors":[{"term_id":31,"user_id":15,"is_guest":0,"slug":"nikos-gkikizas","display_name":"Nikos Gkikizas","avatar_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/04\/signal-2026-04-01-111218.jpeg","0":null,"1":"","2":"","3":"","4":"","5":"","6":"","7":"","8":"","biographical_info":"Nikos is a Staff Data Scientist at Satalia. He started off as a mining engineer and later specialized in explosives and blasting engineering. Through signal processing he was introduced to data science, with which he gradually fell in love with. Over the last decade he's been working in Data Science, having developed various simulation, predictive and optimization solutions."}],"featured_image_url":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/stupid-bot2.png","featured_image_sizes":{"thumbnail":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/stupid-bot2-150x150.png","medium":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/stupid-bot2-300x300.png","large":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/stupid-bot2.png","full":"https:\/\/cms.research.wpp.com\/wp-content\/uploads\/2026\/05\/stupid-bot2.png"},"_links":{"self":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/posts\/1558","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/users\/15"}],"replies":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=1558"}],"version-history":[{"count":5,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/posts\/1558\/revisions"}],"predecessor-version":[{"id":1565,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/posts\/1558\/revisions\/1565"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=\/wp\/v2\/media\/1564"}],"wp:attachment":[{"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=1558"}],"wp:term":[{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=1558"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/cms.research.wpp.com\/index.php?rest_route=%2Fwp%2Fv2%2Fppma_author&post=1558"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}