Datasets

⌘K

benchmarkdevopslatestmonitoringsystemrev. 1

DevOps-Gym benchmark adapted to Harbor format - 729 tasks across 5 categories: Build, Monitoring, Issue Resolving, Test Generation, and End-to-End

728 tasks

pgcodellm/rebench-v2-test

latestrev. 5

rebench-v2-test

20 tasksArthur Leung

factory-ai/legacy-bench

latestv0.1rev. 1

Legacy-Bench public sample tasks for evaluating AI coding agents on legacy software engineering tasks.

10 tasksFactory AI

maxbittker/runebench

latestv1.0rev. 2

Benchmark suite for evaluating AI agents on RuneScape gameplay tasks

32 tasks

scale-ai/swe-atlas-qna

latestrev. 1

SWE-Atlas - Codebase QnA is a benchmark of deep codebase comprehension and QnA problems for coding agents. Checkout https://github.com/scaleapi/SWE-Atlas/ for instructions on running it.

124 tasksMohit Raghavendra

webgen-bench/webgen-bench

latestrev. 1

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch (101 test tasks)

scale-ai/swe-atlas-tw

latestrev. 2

SWE-Atlas - Test Writing -- A benchmark of comprehensive test writing problems for coding agents. Checkout https://github.com/scaleapi/SWE-Atlas/ for instructions on running it.

90 tasksMohit Raghavendra

cais/swebenchpro

latestrev. 2

SWE-bench Pro with anti-exploitation (git history isolation + GitHub network blocking). 731 tasks, Python/JS/TS/Go.

bigcode/bigcodebench-hard-complete

latestrev. 1

145 tasks

LiteCoder/LiteCoder-rl

latestrev. 1

LiteCoder Terminal RL environments

602 tasksLiteCoder

cognition/cognition-evals-test

o11y-bench: an open agentic observability benchmark. Measures how well AI agents perform 63 real-world observability tasks across logs, metrics, traces, dashboards, and incident workflows.

63 tasksGrafana

nvats/codeskills-bench

latestv0.1rev. 1

A small set of real-life programming tasks: bug fixes, merge-conflict resolution, dependency cleanup, API migration, and performance regressions across compact Python repositories.

23 tasksNaman Vats

gabeorlanski/slopcodebench

latestrev. 2

SlopCodeBench multi-checkpoint coding benchmark tasks converted for Harbor.

36 tasksGabriel Orlanski

yanagiorigami/frontier-cs

latestrev. 1

Frontier-CS competitive programming benchmark: 172 open-ended algorithmic problems with partial scoring via go-judge.

172 tasksYanagiOrigami

sierra-research/tau3-bench

latestrev. 1

tau3-bench is a simulation framework for evaluating customer service agents across multiple domains. It includes 375 tasks spanning four domains: airline, retail, telecom, and banking knowledge.

375 tasksVictor Barres, Honghua Dong, Soham Ray, Xujie Si, Karthik Narasimhan

shogunpurple/9CBCC967-BDAF-43D0-9485-52EBBE4A5094

RExBench - 2 tasks (cogs, othello) evaluating AI agents' ability to extend existing AI research through research experiment implementation. Tasks require an A100 40GB GPU and may take multiple hours (cogs: ~2.5h oracle; othello: ~45min oracle). Parity: openhands@1.4.0 / anthropic/claude-sonnet-4-5, 3 runs — cogs: 100% Harbor vs 66.67% original; othello: 100% Harbor vs 100% original. Adapter by Nicholas Edwards (nicholas.edwards@univie.ac.at), one of the original RExBench authors. Acknowledgements: We thank 2077 AI for generously funding API credits to support running parity experiments.

2 tasksNicholas Edwards

minnesotanlp/aar

1.0latestrev. 1

The Amazing Agent Race (AAR): 1400 multi-step scavenger-hunt puzzles for evaluating LLM agents on tool use, web navigation, and arithmetic reasoning. Includes linear (800) and DAG (600) variants across 4 difficulty levels.

1,400 tasksZae Myung Kim

cmu/refav

latestrev. 2

RefAV benchmark

1,500 tasksCainan Davidson, Deva Ramanan, Neehar Peri