Datasets
| Dataset | Tasks |
|---|---|
dbt-labs/ade-bench | 48 |
featurebench/featurebench-lite-modal | 30 |
vals/financeagent | 50 |
kumo/kumo-1 | 5,300 |
crustbench/crustbench | 100 |
gaia/gaia | 165 |
evoeval/evoeval | 100 |
ineqmath/ineqmath | 100 |
kumo/kumo-parity | 212 |
binary-audit/binary-audit | 46 |
aider/aider-polyglot | 225 |
arcprize/arc-agi-2 | 167 |
tencent/autocodebench | 200 |
futurehouse/labbench | 181 |
scienceagentbench/scienceagentbench | 102 |
stanford/medagentbench | 300 |
terminal-bench/terminal-bench-2 | 89 |
gorilla/bfcl | 3,641 |
aime/aime | 60 |
replicationbench/replicationbench | 90 |
gorilla/bfcl_parity | 123 |
futurehouse/bixbench-cli | 205 |
theagentcompany/theagentcompany | 174 |
quesma/otel-bench | 26 |
reasoning-gym/reasoning-gym-hard | 288 |
lawbench/lawbench | 1,000 |
camel-ai/seta-env | 1,376 |
strongreject/strongreject | 150 |
kumo/kumo-easy | 5,050 |
reasoning-gym/reasoning-gym-easy | 288 |
193 datasets