Datasets

DatasetTasks
dbt-labs/ade-bench
48
featurebench/featurebench-lite-modal
30
vals/financeagent
50
kumo/kumo-1
5,300
crustbench/crustbench
100
gaia/gaia
165
evoeval/evoeval
100
ineqmath/ineqmath
100
kumo/kumo-parity
212
binary-audit/binary-audit
46
aider/aider-polyglot
225
arcprize/arc-agi-2
167
tencent/autocodebench
200
futurehouse/labbench
181
scienceagentbench/scienceagentbench
102
stanford/medagentbench
300
terminal-bench/terminal-bench-2
89
gorilla/bfcl
3,641
aime/aime
60
replicationbench/replicationbench
90
gorilla/bfcl_parity
123
futurehouse/bixbench-cli
205
theagentcompany/theagentcompany
174
quesma/otel-bench
26
reasoning-gym/reasoning-gym-hard
288
lawbench/lawbench
1,000
camel-ai/seta-env
1,376
strongreject/strongreject
150
kumo/kumo-easy
5,050
reasoning-gym/reasoning-gym-easy
288