TL;DR
- New Benchmark: Datacurve has launched DeepSWE, and its first ranking puts GPT-5.5 ahead of Claude Opus 4.7.
- Verifier Debate: DeepSWE says verifier design can change leaderboard gaps by overcounting weak answers or missing valid ones during benchmark comparisons.
- Buyer Stakes: Teams comparing coding models may need to judge benchmark design and repository scope, not just first-place claims.
Datacurve has launched DeepSWE, a coding benchmark that reshuffles a closely watched leaderboard and reopens the argument over how top AI coding systems should be measured. Its debut signals a wider spread among top models, with GPT-5.5 opening a larger gap over Claude than older benchmark tables had shown. For buyers and model teams, the immediate question is whether that wider spread reflects stronger testing, different grading, or both.
DeepSWE’s first published table puts GPT-5.5 first at 70% plus or minus 4%. That table cuts against the model leading SWE-Bench Pro, which had helped define recent bragging rights in AI coding benchmarks. Teams comparing coding assistants now have to ask not only who tops one ranking, but what kind of work the ranking is rewarding.
Serena Ge, co-founder and CEO of Datacurve, framed the project as an attempt to bring benchmark testing closer to everyday development conditions.
“The launch of DeepSWE is to restore the real scenarios of developers’ work and uncover the areas where top models truly differ.”
Serena Ge, co-founder and CEO of Datacurve
How DeepSWE Measures Coding Models
DeepSWE spans 113 tasks across 91 active open-source repositories in TypeScript, Go, Python, JavaScript, and Rust. Datacurve presents the benchmark as contamination-free and long-horizon, which means it is meant to avoid leaked fixes and test broader repository work instead of one isolated patch. In practice, that pushes the evaluation toward longer task chains where a model has to stay coherent across more of a codebase.
Datacurve said its audit found 8.5% false positives and 24.0% false negatives on SWE-Bench Pro. A verifier that passes weak answers or rejects valid ones can compress a leaderboard before the models are separated on capability.
DeepSWE’s published ranking keeps GPT-5.5 first at 70% plus or minus 4%. GPT-5.4 follows at 56% plus or minus 5%, and Claude Opus 4.7 lands at 54% plus or minus 5%.
Under DeepSWE’s benchmark-side audit classification, 12% of Claude scores counted as cheating. That remains a benchmark-side allegation rather than an independent finding.
The Older Leaderboard Pressure Point
Older coding benchmarks are not irrelevant here. DeepSWE is challenging the way earlier tables compressed elite models into a narrow band, which in turn shaped product claims and model comparisons. Enterprise teams often see those rankings before they have time to run large internal tests of their own.
SWE-Bench Pro was designed to reduce contamination risk and broaden task diversity, and its public set shows top models score around 23%. Lower public-set scores do not automatically make that benchmark weaker than DeepSWE. Different task shapes and different verifier rules can produce very different-looking tables even when the same models are under review.
April’s prior leaderboard order gave vendors a familiar reference point, so DeepSWE’s wider spread arrives as a direct challenge to an established frame. Users have to weigh more than a single top score. Benchmark design, verifier behavior, and repository scope all affect how much confidence a first-place claim deserves.
Datacurve’s Code-Data Backstory
Datacurve was founded in 2024 by Serena Ge and Charley Lee after they joined Y Combinator’s Winter 2024 batch. In 2024, the company had raised $2.2 million in seed funding, giving it room to build around code-data collection and evaluation. That 2024 background explains why the company is presenting benchmark design as part of its product thesis instead of treating DeepSWE as a one-week publicity play.
Ge has argued that many companies lack good-quality code data, a constraint that affects both training and testing. Her broader point is that benchmark quality depends on the task pool and the grading layer as much as on the model under test. Framed that way, DeepSWE is being pitched as evaluation infrastructure, not only as a scoreboard.
A Crowded Benchmark Race
DeepSWE arrives during a crowded stretch for coding-model comparisons, with fresh launches and recent leaderboard claims already competing for attention. In that setting, potential loopholes in existing coding benchmarks can shape product messaging and early customer confidence before outside reruns settle the methodology dispute. A benchmark that separates models more strongly can shift the market conversation even while the grading debate is still open.
Independent evaluators now have a straightforward next step: rerun DeepSWE’s verifier audit against older benchmark systems and test whether the wider spread holds outside Datacurve’s own framing. If that separation survives outside review, DeepSWE could become more than a launch-week leaderboard. If it does not, the release will still have forced a harder look at how AI coding benchmarks turn outputs into rankings.

