Datasets
⌘K
featurebench/featurebench-modal
latestrev. 1
200 tasks
bigcode/humanevalfix
latestrev. 1
164 tasks
dbt-labs/ade-bench
latestrev. 1
48 tasks
featurebench/featurebench-lite-modal
latestrev. 1
30 tasks
vals/financeagent
latestrev. 1
50 tasks
kumo/kumo-1
latestrev. 1
5,300 tasks
crustbench/crustbench
latestrev. 1
100 tasks
gaia/gaia
latestrev. 1
165 tasks
evoeval/evoeval
latestrev. 1
100 tasks
ineqmath/ineqmath
latestrev. 1
100 tasks
kumo/kumo-parity
latestrev. 1
212 tasks
binary-audit/binary-audit
latestrev. 1
46 tasks
aider/aider-polyglot
latestrev. 1
225 tasks
arcprize/arc-agi-2
latestrev. 1
167 tasks
tencent/autocodebench
latestrev. 1
200 tasks
futurehouse/labbench
latestrev. 1
181 tasks
scienceagentbench/scienceagentbench
latestrev. 1
ScienceAgentBench full benchmark adapted to Harbor with Modal GPU support for GPU-backed tasks.
102 tasksZiru Chen, Shijie Chen, Huan Sun, Allen Hart
stanford/medagentbench
latestrev. 1
300 tasks
terminal-bench/terminal-bench-2
latestrev. 1
89 tasks
gorilla/bfcl
latestrev. 1
3,641 tasks
aime/aime
latestrev. 1
60 tasks
replicationbench/replicationbench
latestrev. 1
90 tasks
gorilla/bfcl_parity
latestrev. 1
123 tasks
futurehouse/bixbench-cli
latestrev. 1
205 tasks
theagentcompany/theagentcompany
latestv1.0rev. 3
TheAgentCompany: 174 professional tasks across GitLab, Plane, OwnCloud, and RocketChat services, evaluating LLM agents on real-world professional work
174 tasksHanwen Xing
quesma/otel-bench
latestrev. 1
26 tasks
reasoning-gym/reasoning-gym-hard
latestrev. 1
288 tasks
lawbench/lawbench
latestrev. 1
1,000 tasks
camel-ai/seta-env
latestrev. 1
1,376 tasks
strongreject/strongreject
latestrev. 1
150 tasks
91 datasets