Datasets
| Dataset | Tasks |
|---|---|
| Dataset | Tasks |
|---|---|
harbor run -d scale-ai/swe-bench-proSWE-Bench Pro is a benchmark designed to provide a rigorous and realistic evaluation of AI agents for software engineering. It was developed to address several limitations in existing benchmarks by tackling four key challenges:
SWE-Bench Pro addresses these gaps by sourcing tasks from diverse and complex codebases, including consumer applications, B2B services, and developer tools. To reduce contamination risk, the public and held-out OSS subsets use strong copyleft licenses (e.g., GPL). The private subset consists of proprietary codebases from startup partners.
The benchmark is significantly more challenging than its predecessors; top models score around 23% on the SWE-Bench Pro public set, compared to 70%+ on SWE-Bench Verified. This provides a more accurate measure of an agent’s true problem-solving capabilities in environments that mirror professional software development.
Read the paper here: https://scale.com/research/swe_bench_pro
Check out the GitHub and view trajectories here: https://docent.transluce.org/dashboard/032fb63d-4992-4bfc-911d-3b7dafcb931f