SWE-bench Verified

SWE-bench Verified is a software-engineering dataset for evaluating whether AI agents can resolve real GitHub issues. Each task gives an agent an issue report and a repository checkout at the relevant base commit; the agent edits the code, and the solution is graded by running issue-specific tests.

SWE-bench contains 2,294 tasks from real GitHub issues and pull requests across 12 popular Python repositories. SWE-bench Verified is a 500-task human-validated subset, created with OpenAI, that checks for clear issue statements, correct tests, and solvable tasks.

Parity

Harbor parity records are in adapters/swebench/parity_experiment.json. Historical validation compares original SWE-bench Verified to the Terminal-Bench adapter, then the legacy Terminal-Bench adapter to the Harbor adapter.

Comparison	Agent	Model	Tasks	Reference	Harbor/TB
Original vs. TB	`codex@0.2.0`	`o4-mini`	500	53.11	53.11
Original vs. TB	`terminus-v1.5-xml`	`claude-opus-4-20250514`	500	66.4	66.4
Original vs. TB	`openhands@c677f728`	`claude-4-sonnet`	500	66.8	67.0
TB vs. Harbor	`terminus-2@0bad1ff4`	`Claude-Sonnet-4-5-20250929`	500	70.0	68.6

The adapter also includes a full-run comparison against a Docent SWE-bench Verified leaderboard run for mini-swe-agent==2.1.0 with openai/gpt-5-mini: official 0.563 +/- 0.000 vs. Harbor 0.545 +/- 0.007 over 499 comparable tasks and 3 Harbor runs. The excluded task was scikit-learn__scikit-learn-14710, a recurring Daytona outlier. Uploaded record: https://huggingface.co/datasets/harborframework/parity-experiments/discussions/149.

Citation

@inproceedings{
    jimenez2024swebench,
    title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
    author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan},
    booktitle={The Twelfth International Conference on Learning Representations},
    year={2024},
    url={https://openreview.net/forum?id=VTF8yNQM66}
}

swe-bench/swe-bench-verified

SWE-bench Verified

Links

Parity

Citation