Datasets
| Dataset | Tasks |
|---|---|
| Dataset | Tasks |
|---|---|
| Dataset | Tasks |
|---|---|
terminal-bench/terminal-bench-2 | 89 |
gabeorlanski/slopcodebench | 36 |
kgmon/deepsearchqa | 900 |
actava-ai/chi-bench | 78 |
datacurve/deep-swe | 113 |
factory-ai/legacy-bench | 10 |
nvats/codeskills-bench | 23 |
aarr/aarri-bench | 82 |
abundant/swe-gen-cpp | 999 |
abundant/swe-gen-go | 1,000 |
abundant/swe-gen-java | 1,000 |
abundant/swe-gen-js | 1,000 |
abundant/swe-gen-rust | 1,000 |
adyen/dabstep | 450 |
agentic-labs/erp-bench | 300 |
agentscope-ai/pawbench | 150 |
ai-forever/harness-bench-fast | 231 |
aider/aider-polyglot | 225 |
aime/aime | 60 |
algotune/algotune | 154 |
apple/mmau | 1,000 |
arcprize/arc-agi-2 | 167 |
bauerjustin/tb3-preview-v2 | 111 |
bauerjustin/terminal-bench-3-test | 64 |
bigcode/bigcodebench-hard-complete | 145 |
bigcode/humanevalfix | 164 |
binary-audit/binary-audit | 46 |
cais/swebenchpro | 731 |
camel-ai/seta-env | 1,376 |
cmu/refav | 1,500 |
204 datasets