Benchmark Reveals AI Search Agents Guess Before They Search

TL;DR

Live Test: LiveBrowseComp found sub-2 percent closed-book accuracy on fresher benchmark questions.
Search Pattern: The paper found many queries came from model-led hunches instead of retrieved evidence.
Score Shift: Reported live-browsing results dropped MiniMax sharply and moved DeepSeek ahead of GLM-5.1.
Benchmark Design: The test uses 335 human-authored questions tied to facts from the prior 90 days.
Buyer Risk: Enterprise teams may need live-benchmark checks before trusting browsing scores in procurement.

LiveBrowseComp has found sub-2 percent accuracy when the benchmark tested AI agents on current facts instead of familiar answer patterns. Browsing leaderboards still shape buying decisions for teams that need agents to find and use fresh information.

LiveBrowseComp is a dynamic deep-search benchmark introduced in May 2026 to evaluate Large Language Model (LLM) search agents beyond their pre-trained, memory-backed knowledge. It addresses “Intrinsic Knowledge Dependence” (IKD), the tendency for AI agents to use web searches merely to verify what they already know rather than executing genuine, evidence-driven discovery.

LiveBrowseComp uses 335 human-authored questions tied to facts from the prior 90 days and filtered to avoid globally salient events.

Against the static benchmark, search-augmented systems posted a 25 to 40 point drop. BrowseComp-style tables are often treated as a proxy for agentic research ability, but a live benchmark asks whether a system can locate newer evidence, keep it in view, and let it change the answer.

The authors summed up the problem in one short line: “The loop is model-led, not evidence-led.” Many agents, they argue, begin with a likely answer and use search to check it, so the web verifies a hunch instead of supplying the response.

Why LiveBrowseComp Reordered the Field

BrowseComp already serves as a common yardstick for agentic research ability. LiveBrowseComp changes the task from broad recall to current evidence work, which lets the benchmark separate memorized knowledge from retrieval that stays grounded in a fresh source.

In the paper’s tests, agents answered up to 44.5 percent of BrowseComp questions without tools. Strong no-tools performance can look impressive on its own, yet benchmark strength may still come from memory before live browsing adds much value.

Diagnostic runs found more than half of search queries came from internally generated hypotheses instead of leads gathered from retrieved pages. In those runs, the model’s own hunch could steer the search path before outside material had much chance to redirect the answer.

Removing answer-supporting evidence from the index pushed evaluated agents below their closed-book baselines. For one MiniMax example linked to the earlier BrowseComp benchmark race, the reported score dropped from 44.5 percent to 8.0 percent.

Across the four-model set, the actual evidence use rate stayed between 24.7 percent and 32.2 percent. For teams evaluating research assistants or support agents, that is the practical warning: a respectable browsing score can still hide a workflow in which retrieved material plays a limited role in the final answer.

LiveBrowseComp also reshuffled familiar standings. DeepSeek v3.2 moved to the top of the LiveBrowseComp ranking, while GLM-5.1 slipped from its earlier BrowseComp standing to the middle of the pack.

In the same comparison, Kimi K2.6 reached 62 percent on the BrowseComp-ZH closed-book variant. Static scores and live scores measure different things, because one can reward answer recognition while the other tests whether a system can keep a current source in play through the final response.

Earlier Benchmark Prestige Meets New Evidence

Several of the same model families have already built benchmark prestige earlier in 2026. In October 2025, MiniMax’s earlier model positioning helped establish that reputation, and in February 2026 MiniMax put M2.5 at 1 USD to run for an hour at 100 tokens per second.

GLM-5.1’s earlier BrowseComp standing likewise framed it as a leading open-weight option before LiveBrowseComp changed the order. Historical contrast explains why the reshuffle matters: the new paper asked whether that prestige held up once the questions depended on information new enough to weaken memorized answers.

What This Changes for AI Search Claims

Procurement teams feel that difference directly because users choosing a research assistant, support bot, or browsing agent are not just comparing raw scores. They are deciding whether a system can find a fresh source, keep that source in view, and rely on it when the prompt moves beyond material the model may already know.

The benchmark uses 335 human-authored questions tied to the prior 90 days, and the authors argue dynamic, time-sensitive checks should become standard if browsing scores are supposed to reflect evidence-led retrieval rather than answer verification.

Benchmark Reveals AI Search Agents Guess Before They Search

Recent Articles

Google lets users connect apps to AI mode in Search

The AI compute gap: Enterprises are buying infrastructure faster than they can measure what it costs

Shark vacuums with flawed Amazon policy can easily expose millions of user data

Kimi K3 AI Model: Specs, Benchmarks, and Open Weights

Meta’s Oversight Board Finds Top AI Models Are Hesitant to Criticize Repressive Governments

Related Stories