Using DeepSeek v4 Pro as an Agentic Brain

1. The context

DeepSeek’s release pattern has been consistent: ship a model that posts frontier-comparable benchmark numbers at an order-of-magnitude lower price, then watch the other providers scramble. DeepSeek v4 Pro is the most aggressive instance of that pattern yet. Rather than putting it to the test via a standard off-the-shelf agentic benchmark, we focused on a more pertinent question: what happens when you actually deploy it as the LLM brain of an agent?

“Agentic ability” packs in at least three dimensions:

Behaving like a competent professional in realistic settings. Stay in character. Stay focused on your objective. Communicate effectively with different types of stakeholders. Respect policies and constraints. Validate the quality of the information that you consume and produce. Adapt to changing circumstances.
Long-horizon autonomous problem solving. Read a codebase, form a hypothesis, run an experiment, read the result, build on it. Repeat many times without a human in the loop.
Cost-efficiency under sustained load. Being able to solve complex problems and succeed in real-world scenarios for $0.05 per task is a very different proposition from achieving the same outcomes for $1.50 per task, even if the success rate is identical.

To evaluate DeepSeek across all three dimensions, we picked two different agentic tools:

VerifyAX (Conscium’s agent-evaluation platform). VerifyAX drops an agent into the kind of situation it would actually face once deployed: a realistic scenario populated by other characters (customers, interviewers, colleagues, adversaries), with an objective to achieve and rules to follow. Scenarios are automatically generated to exercise a wide panel of specific skills, from communication and safety to technical reasoning.
Autoresearch (Andrej Karpathy’s open-source autonomous-research framework). It hands an agent a piece of code and a metric to improve, then steps back. The agent reads the code, makes one change, runs it, checks whether the metric improved, and decides whether to keep or undo the change. This continues for many iterations, with no human in the loop.

Cost (the third dimension) can be easily measured on both VerifyAX and Autoresearch, so we report it separately.

To benchmark DeepSeek v4 Pro against the rest of the frontier, we set up four agents, each powered by a different LLM: DeepSeek v4 Pro, GPT 5.5 (OpenAI), Claude Opus 4.6 (Anthropic), and Gemini 3.1 Pro (Google). The agent harness is identical across all four; only the LLM brain changes.

2. Experiment A: VerifyAX

Setup

We ran all four LLM-powered agents (powered by Opus 4.6, Gemini 3.1 Pro, DeepSeek v4 Pro, and GPT 5.5) through two VerifyAX-generated scenario suites:

The harder suite — 25 multi-agent scenarios, in which the agent is given a fixed amount of time to navigate multiple stakeholders (manifested as VerifyAX-controlled agents), exchange information, use tools, and make decisions in order to achieve an objective. Each scenario is designed to test for specific skills.
The easier suite — 25 one-on-one scenarios. Similar to the harder suite, but with only one stakeholder in each scenario.

Every run produces a pass/fail verdict per scenario plus a per-skill score (1–5) for every skill the scenario tests for. Across the 50 scenarios, the two suites collectively exercise 60+ distinct skills, drawn from a wider VerifyAX panel that covers safety, reasoning under pressure, information extraction, communication skills, and technical execution. A representative sample is shown in the table below:

Skill tag	What it tests
`goal_injection_resistance`	Resists being hijacked into a new, conflicting mission mid-scenario.
`incremental_corruption_resistance`	Notices when a chain of seemingly-harmless requests cumulatively breaks the rules.
`sycophancy_resistance`	Refuses to validate a demonstrably false statement just because the counterparty insists on it.
`hate_speech_refusal`	Refuses to produce hate speech or discriminatory content.
`irreversible_action_caution`	Insists on explicit informed consent before taking unrecoverable actions.
`ambiguity_clarification`	Pins down vague jargon or undefined terms before acting.
`contradiction_handling`	Calls out inconsistencies in a counterparty’s statements tactfully, without triggering defensiveness.
`tradeoff_reasoning`	Weighs competing options and articulates the cost of each choice.
`data_hallucination_resistance`	Asks for missing materials instead of inventing them when context is incomplete.
`long_horizon_thinking`	Plans and sequences actions across many steps toward a distant goal.
`anger_deescalation`	Acknowledges emotion, defuses conflict, proposes concrete next steps.
`intermediate_math`	Solves multi-step quantitative problems.
`advanced_web_research`	Answers complex questions requiring browsing, cross-referencing multiple sources, and synthesis.
`advanced_programming`	Solves complex programming problems.

Result

Harder suite (multi-agent):

Model	Pass rate	Avg skill grade (1–5)	Agent cost	$ / scenario
Claude Opus 4.6	13/25 (52%)	4.63	$19.95	$0.80
Gemini 3.1 Pro	8/25 (32%)	4.42	$1.81	$0.07
DeepSeek v4 Pro	7/25 (28%)	4.04	$0.48	$0.02
GPT 5.5	7/25 (28%)	4.25	$1.79	$0.07

Easier suite (one-on-one):

Model	Pass rate	Avg skill grade (1–5)	Agent cost	$ / scenario
Claude Opus 4.6	23/25 (92%)	4.83	$17.09	$0.68
Gemini 3.1 Pro	23/25 (92%)	4.71	$0.99	$0.04
GPT 5.5	20/25 (80%)	4.52	$1.19	$0.05
DeepSeek v4 Pro	19/25 (76%)	4.58	$0.30	$0.01

Three things jump out:

Claude wins, comfortably. On the harder suite, 52% vs everyone else clustered at 28–32% — and the highest macro-averaged skill grade (4.63) of any model on either suite. On the easier suite, Claude and Gemini tie on pass rate at 92%, but Claude still edges Gemini on skill grade (4.83 vs 4.71).
DeepSeek sits at the bottom on both suites. On the harder suite it ties GPT 5.5 at the bottom of the table (both 7/25, 28%). On the easier suite it’s the weakest of the four on pass rate (19/25, 76%), though its skill grade (4.58) actually edges GPT’s (4.52).
The cost spread is startling. Claude’s per-scenario spend is ~40× DeepSeek’s on the harder suite and ~55× on the easier one. GPT 5.5 and Gemini sit in the same mid-range bracket (~$0.04–0.07/scenario); only Claude is in a different tier.

3. Experiment B: Autoresearch

Setup

Each agent starts with just two files:

train.py — a ~630-line script that trains a small language model from scratch. The starting model is small by today’s standards — 8.7M parameters, the kind of tiny transformer you’d find in an early GPT-2.
program.md — a short prompt telling the agent what to do.

The agent then runs unattended, looping through these steps:

Read the current training script and a log of everything it has tried before.
Propose one specific code change.
Train the modified model on GPU hardware for a fixed 5-minute budget.
Show the trained model a chunk of text it has never seen and measure how well it predicts what comes next, character by character.
Keep the change if the new score is better than the best so far; otherwise discard it and try something different next time.
Repeat 50 times.

Result

Model	% Improvement	Cost	Successful experiments
GPT 5.5	6.83%	$33.53	8 / 50
Gemini 3.1 Pro	6.12%	$10.21	9 / 50
DeepSeek v4 Pro	6.11%	$3.69	8 / 50
Claude Opus 4.6	6.04%	$63.45	7 / 50

The improvement numbers cluster tightly. The cost numbers do not: Opus 4.6 cost ~17× more than DeepSeek v4 Pro for essentially the same outcome.

More interesting than the bottom line is the strategy fingerprint each model converged to. All four independently rediscovered the same single biggest win: cutting the training batch size in half (which trades smaller-per-step learning updates for more update steps inside the 5-minute budget — a good trade when the bottleneck is wall-clock, not data). After that they diverged:

Opus 4.6 kept the network’s outer shape and redesigned the building blocks inside each layer — a more expressive math operation in every block (an activation function called SwiGLU, instead of ReLU²) and 50% more internal capacity per layer. Same outside, smarter inside.
GPT 5.5 opted for a smaller, faster network (3 layers instead of 4, shorter context windows) so it could fit more training steps into the budget, with optimizer settings tuned to make those extra steps count.
DeepSeek v4 Pro combined GPT’s move with the only attention-mechanism change that survived in any model’s final config: grouped-query attention (reusing key/value projections across heads to compress the attention block).
Gemini 3.1 Pro left the network alone and changed how it was trained — same layers, same shape, same building blocks, but turned learning rates up and drove weight decay to zero. Every architectural change it tried, it reverted.

Those architectural moves had visible consequences for the final model size: Opus’s wider MLP made the model bigger than the 8.7M-parameter baseline, Gemini kept it at baseline size, GPT shrank it, and DeepSeek shrank it most — to 3.4M parameters, less than half the baseline.

Four genuinely distinct strategies landing within 0.8 percentage points of each other is itself an interesting result, suggesting there are several different ways to win at this task within the 5-minute budget, and that experimenting with different models can lead to distinct but equally promising paths. Increasing the rounds and the time budget per round can help explore these paths further.

The Bottom Line

On Autoresearch, where the LLM-powered agent is the only stakeholder and the loop is a tight code-edit / measure-result cycle, DeepSeek is tied with the top model on outcome (6.11% improvement, within 0.8 pp of GPT 5.5) at a fraction of the cost. On VerifyAX, where the agent has to survive multi-agent simulations that emulate real-world scenarios, it maintains its cost advantage but lands at the very bottom of the rankings in terms of skills and objective completion. This highlights something we already knew: the key is to pick the right tool (LLM brain) for the job. If you care about cost and expect your agent to work on a problem on its own for many iterations, then evidence suggests that DeepSeek is definitely worth a shot. However, if you expect your agent to operate in dynamic real-world environments with other stakeholders and complex constraints, then there are better options out there.

If you are looking for an LLM Brain that can perform in both contexts, Gemini is the most consistent of the four. Always in the top half, ties Claude on the easier VerifyAX suite at a fraction of the cost, and finishes Autoresearch essentially tied with DeepSeek. The pragmatic default if you don’t want to commit to either end of the cost/capability spectrum.

Our experiments are just two of many that could (and should) be run to evaluate a model as powerful and multi-faceted as a frontier LLM. They offer some real evidence about where each LLM lands in the contexts we texted, but there’s plenty more work to do to establish how it performs in other settings.

Authors

Theodoros Lappas

Ted co-leads WPP Research and serves as Head of Data Science at Satalia, co-founder of Conscium, and Assistant Professor in the Department of Marketing and Communication at the Athens University of Economics and Business. His research spans scalable algorithms for multimodal data, synthetic data generation, simulation-based verification for AI agents, and information diffusion and collective intelligence in expert networks. He publishes regularly in top-tier computer science and business venues.
Andreas Stavrou

Andreas is a Senior Data Scientist at Satalia and part of the WPP Research team. With over a decade of hands-on experience, he has delivered machine learning solutions across retail, online gambling, and credit risk, and now builds AI systems at scale in the global advertising industry.
Ilias Papastratis

Ilias Papastratis is a Senior Data Scientist based in United Kingdom. He specializes in LLM-based agentic architectures, generative models, and computer vision for large-scale enterprise applications. Currently working at Satalia (WPP Group) , he designs and deploys production-grade AI solutions—including custom multi-agent LLM systems and diffusion-based image generation pipelines —for globally recognized clients.

Previously, he spent five years as a Deep Learning Researcher and developed computer vision models for real-time sign language translation and industrial object recognition. He holds a combined Bachelor’s and Master’s degree in Electrical Engineering and Computer Science from the University of Patras , alongside an MSc in Digital Media and Computational Intelligence from Aristotle University of Thessaloniki. In addition to being the author of peer-reviewed publications, he actively contributes to open-source deep learning projects.

1. The context

2. Experiment A: VerifyAX

Setup

Result

3. Experiment B: Autoresearch

Setup

Result

The Bottom Line

Authors

Comments

Leave a Reply Cancel reply

More posts

Agree. Transact. Verify.

Learning through experience: teaching the viral Hermes agent to automate our work

A 50-line python function outperformed every frontier LLM – With 100% accuracy

MiroFish: Is swarm intelligence worth the cloud bill?