Benchmark Reveals AI Search Agents Guess Before They Search


TL;DR

  • Live Test: LiveBrowseComp found sub-2 percent closed-book accuracy on fresher benchmark questions.
  • Search Pattern: The paper found many queries came from model-led hunches instead of retrieved evidence.
  • Score Shift: Reported live-browsing results dropped MiniMax sharply and moved DeepSeek ahead of GLM-5.1.
  • Benchmark Design: The test uses 335 human-authored questions tied to facts from the prior 90 days.
  • Buyer Risk: Enterprise teams may need live-benchmark checks before trusting browsing scores in procurement.

LiveBrowseComp has found sub-2 percent accuracy when the benchmark tested AI agents on current facts instead of familiar answer patterns. Browsing leaderboards still shape buying decisions for teams that need agents to find and use fresh information.

LiveBrowseComp is a dynamic deep-search benchmark introduced in May 2026 to evaluate Large Language Model (LLM) search agents beyond their pre-trained, memory-backed knowledge. It addresses “Intrinsic Knowledge Dependence” (IKD), the tendency for AI agents to use web searches merely to verify what they already know rather than executing genuine, evidence-driven discovery.

LiveBrowseComp uses 335 human-authored questions tied to facts from the prior 90 days and filtered to avoid globally salient events.

Against the static benchmark, search-augmented systems posted a 25 to 40 point drop. BrowseComp-style tables are often treated as a proxy for agentic research ability, but a live benchmark asks whether a system can locate newer evidence, keep it in view, and let it change the answer.

The authors summed up the problem in one short line: “The loop is model-led, not evidence-led.” Many agents, they argue, begin with a likely answer and use search to check it, so the web verifies a hunch instead of supplying the response.