swe-bench/swe-bench-verified
harbor run -d swe-bench/swe-bench-verifiedSWE-bench Verified
SWE-bench Verified is a software-engineering dataset for evaluating whether AI agents can resolve real GitHub issues. Each task gives an agent an issue report and a repository checkout at the relevant base commit; the agent edits the code, and the solution is graded by running issue-specific tests.
SWE-bench contains 2,294 tasks from real GitHub issues and pull requests across 12 popular Python repositories. SWE-bench Verified is a 500-task human-validated subset, created with OpenAI, that checks for clear issue statements, correct tests, and solvable tasks.
Links
Parity
Harbor parity records are in adapters/swebench/parity_experiment.json. Historical validation compares original SWE-bench Verified to the Terminal-Bench adapter, then the legacy Terminal-Bench adapter to the Harbor adapter.
| Comparison | Agent | Model | Tasks | Reference | Harbor/TB |
|---|---|---|---|---|---|
| Original vs. TB | codex@0.2.0 |
o4-mini |
500 | 53.11 | 53.11 |
| Original vs. TB | terminus-v1.5-xml |
claude-opus-4-20250514 |
500 | 66.4 | 66.4 |
| Original vs. TB | openhands@c677f728 |
claude-4-sonnet |
500 | 66.8 | 67.0 |
| TB vs. Harbor | terminus-2@0bad1ff4 |
Claude-Sonnet-4-5-20250929 |
500 | 70.0 | 68.6 |
The adapter also includes a full-run comparison against a Docent SWE-bench Verified leaderboard run for mini-swe-agent==2.1.0 with openai/gpt-5-mini: official 0.563 +/- 0.000 vs. Harbor 0.545 +/- 0.007 over 499 comparable tasks and 3 Harbor runs. The excluded task was scikit-learn__scikit-learn-14710, a recurring Daytona outlier. Uploaded record: https://huggingface.co/datasets/harborframework/parity-experiments/discussions/149.
Citation
@inproceedings{
jimenez2024swebench,
title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=VTF8yNQM66}
}