New DeepSWE Benchmark Puts GPT-5.5 Ahead of Claude Opus 4.7


TL;DR

  • New Benchmark: Datacurve has launched DeepSWE, and its first ranking puts GPT-5.5 ahead of Claude Opus 4.7.
  • Verifier Debate: DeepSWE says verifier design can change leaderboard gaps by overcounting weak answers or missing valid ones during benchmark comparisons.
  • Buyer Stakes: Teams comparing coding models may need to judge benchmark design and repository scope, not just first-place claims.

Datacurve has launched DeepSWE, a coding benchmark that reshuffles a closely watched leaderboard and reopens the argument over how top AI coding systems should be measured. Its debut signals a wider spread among top models, with GPT-5.5 opening a larger gap over Claude than older benchmark tables had shown. For buyers and model teams, the immediate question is whether that wider spread reflects stronger testing, different grading, or both.

DeepSWE’s first published table puts GPT-5.5 first at 70% plus or minus 4%. That table cuts against the model leading SWE-Bench Pro, which had helped define recent bragging rights in AI coding benchmarks. Teams comparing coding assistants now have to ask not only who tops one ranking, but what kind of work the ranking is rewarding.

Serena Ge, co-founder and CEO of Datacurve, framed the project as an attempt to bring benchmark testing closer to everyday development conditions.

“The launch of DeepSWE is to restore the real scenarios of developers’ work and uncover the areas where top models truly differ.”

Serena Ge, co-founder and CEO of Datacurve

How DeepSWE Measures Coding Models

DeepSWE spans 113 tasks across 91 active open-source repositories in TypeScript, Go, Python, JavaScript, and Rust. Datacurve presents the benchmark as contamination-free and long-horizon, which means it is meant to avoid leaked fixes and test broader repository work instead of one isolated patch. In practice, that pushes the evaluation toward longer task chains where a model has to stay coherent across more of a codebase.

Datacurve said its audit found 8.5% false positives and 24.0% false negatives on SWE-Bench Pro. A verifier that passes weak answers or rejects valid ones can compress a leaderboard before the models are separated on capability.