Datasets

K
MichaelY310/devopsgym
benchmarkdevopslatestmonitoringsystemrev. 1

DevOps-Gym benchmark adapted to Harbor format - 729 tasks across 5 categories: Build, Monitoring, Issue Resolving, Test Generation, and End-to-End

728 tasks
pgcodellm/rebench-v2-test
latestrev. 5

rebench-v2-test

20 tasksArthur Leung
factory-ai/legacy-bench
latestv0.1rev. 1

Legacy-Bench public sample tasks for evaluating AI coding agents on legacy software engineering tasks.

10 tasksFactory AI
maxbittker/runebench
latestv1.0rev. 2

Benchmark suite for evaluating AI agents on RuneScape gameplay tasks

32 tasks
scale-ai/swe-atlas-qna
latestrev. 1

SWE-Atlas - Codebase QnA is a benchmark of deep codebase comprehension and QnA problems for coding agents. Checkout https://github.com/scaleapi/SWE-Atlas/ for instructions on running it.

124 tasksMohit Raghavendra
webgen-bench/webgen-bench
latestrev. 1

WebGen-Bench: Evaluating LLMs on Generating Interactive and Functional Websites from Scratch (101 test tasks)

101 tasksChengrui Ma
cookbook/test
latestrev. 1
1 task
scale-ai/swe-atlas-tw
latestrev. 2

SWE-Atlas - Test Writing -- A benchmark of comprehensive test writing problems for coding agents. Checkout https://github.com/scaleapi/SWE-Atlas/ for instructions on running it.

90 tasksMohit Raghavendra
cais/swebenchpro
latestrev. 2

SWE-bench Pro with anti-exploitation (git history isolation + GitHub network blocking). 731 tasks, Python/JS/TS/Go.

731 tasks
futurehouse/bixbench
latestrev. 1
205 tasks
quesma/compilebench
latestrev. 1
15 tasks
bigcode/bigcodebench-hard-complete
latestrev. 1
145 tasks
LiteCoder/LiteCoder-rl
latestrev. 1

LiteCoder Terminal RL environments

602 tasksLiteCoder
cognition/cognition-evals-test
latestrev. 1
10 tasks
grafana/o11y-bench
latestrev. 1

o11y-bench: an open agentic observability benchmark. Measures how well AI agents perform 63 real-world observability tasks across logs, metrics, traces, dashboards, and incident workflows.

63 tasksGrafana
nvats/codeskills-bench
latestv0.1rev. 1

A small set of real-life programming tasks: bug fixes, merge-conflict resolution, dependency cleanup, API migration, and performance regressions across compact Python repositories.

23 tasksNaman Vats
gabeorlanski/slopcodebench
latestrev. 2

SlopCodeBench multi-checkpoint coding benchmark tasks converted for Harbor.

36 tasksGabriel Orlanski
yanagiorigami/frontier-cs
latestrev. 1

Frontier-CS competitive programming benchmark: 172 open-ended algorithmic problems with partial scoring via go-judge.

172 tasksYanagiOrigami
sierra-research/tau3-bench
latestrev. 1

tau3-bench is a simulation framework for evaluating customer service agents across multiple domains. It includes 375 tasks spanning four domains: airline, retail, telecom, and banking knowledge.

375 tasksVictor Barres, Honghua Dong, Soham Ray, Xujie Si, Karthik Narasimhan
shogunpurple/9CBCC967-BDAF-43D0-9485-52EBBE4A5094
latestrev. 1
157 tasks
rexbench/rexbench
latestrev. 1

RExBench - 2 tasks (cogs, othello) evaluating AI agents' ability to extend existing AI research through research experiment implementation. Tasks require an A100 40GB GPU and may take multiple hours (cogs: ~2.5h oracle; othello: ~45min oracle). Parity: openhands@1.4.0 / anthropic/claude-sonnet-4-5, 3 runs — cogs: 100% Harbor vs 66.67% original; othello: 100% Harbor vs 100% original. Adapter by Nicholas Edwards (nicholas.edwards@univie.ac.at), one of the original RExBench authors. Acknowledgements: We thank 2077 AI for generously funding API credits to support running parity experiments.

2 tasksNicholas Edwards
minnesotanlp/aar
1.0latestrev. 1

The Amazing Agent Race (AAR): 1400 multi-step scavenger-hunt puzzles for evaluating LLM agents on tool use, web navigation, and arithmetic reasoning. Includes linear (800) and DAG (600) variants across 4 difficulty levels.

1,400 tasksZae Myung Kim
cmu/refav
latestrev. 2

RefAV benchmark

1,500 tasksCainan Davidson, Deva Ramanan, Neehar Peri
adyen/dabstep
latestrev. 1
450 tasks
xlang/ds-1000
latestrev. 1
1,000 tasks
apple/mmau
latestrev. 1
1,000 tasks
openthoughts/openthoughts-tblite
latestrev. 1
100 tasks
openai/mmmlu
latestrev. 1
150 tasks
meta/mlgym-bench
latestrev. 1
12 tasks
featurebench/featurebench-lite
latestrev. 1
30 tasks

91 datasets