swe-bench/swe-bench-verified

harbor run -d swe-bench/swe-bench-verified

SWE-bench Verified

SWE-bench Verified is a software-engineering dataset for evaluating whether AI agents can resolve real GitHub issues. Each task gives an agent an issue report and a repository checkout at the relevant base commit; the agent edits the code, and the solution is graded by running issue-specific tests.

SWE-bench contains 2,294 tasks from real GitHub issues and pull requests across 12 popular Python repositories. SWE-bench Verified is a 500-task human-validated subset, created with OpenAI, that checks for clear issue statements, correct tests, and solvable tasks.

Links

Parity

Harbor parity records are in adapters/swebench/parity_experiment.json. Historical validation compares original SWE-bench Verified to the Terminal-Bench adapter, then the legacy Terminal-Bench adapter to the Harbor adapter.

Comparison Agent Model Tasks Reference Harbor/TB
Original vs. TB codex@0.2.0 o4-mini 500 53.11 53.11
Original vs. TB terminus-v1.5-xml claude-opus-4-20250514 500 66.4 66.4
Original vs. TB openhands@c677f728 claude-4-sonnet 500 66.8 67.0
TB vs. Harbor terminus-2@0bad1ff4 Claude-Sonnet-4-5-20250929 500 70.0 68.6

The adapter also includes a full-run comparison against a Docent SWE-bench Verified leaderboard run for mini-swe-agent==2.1.0 with openai/gpt-5-mini: official 0.563 +/- 0.000 vs. Harbor 0.545 +/- 0.007 over 499 comparable tasks and 3 Harbor runs. The excluded task was scikit-learn__scikit-learn-14710, a recurring Daytona outlier. Uploaded record: https://huggingface.co/datasets/harborframework/parity-experiments/discussions/149.

Citation

@inproceedings{
    jimenez2024swebench,
    title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
    author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan},
    booktitle={The Twelfth International Conference on Learning Representations},
    year={2024},
    url={https://openreview.net/forum?id=VTF8yNQM66}
}