Terminal-Bench

@terminal-bench

Datasets

Tasks

terminal-bench/adaptive-rejection-sampler

Evaluates the ability to implement an adaptive rejection sampler in R with proper statistical algorithms, modular design, input validation, log-concavity checking, and formal testing.

applied-statistics, adaptive-rejection-sampling, Bayesian-inference, simulation, scientific-computing

jvpoulos

terminal-bench/bn-fit-modify

latestrev. 1

Evaluates the ability to recover a Bayesian Network DAG structure from data, perform causal interventions, and sample from the modified network.

bayesian-network, stats, scientific-computing

Gabriel Dreiman

terminal-bench/break-filter-js-from-html

latestrev. 1

Evaluates the agent's ability to bypass an HTML sanitization filter by crafting malicious HTML that triggers JavaScript execution after filtering.

security

Nicholas Carlini

terminal-bench/build-cython-ext

latestrev. 1

Evaluates the ability to compile and install a Python package with Cython extensions from source while fixing NumPy 2.x compatibility issues.

coding, dependency, compilation, debugging

Zizhao Chen

terminal-bench/build-pmars

latestrev. 1

Evaluates the ability to build pMARS from Debian source packages with X11 support disabled, requiring Makefile modification and headless compilation.

build-tools, compilation, debian, gaming, pmars, corewars, software-engineering

Jeong Shin

terminal-bench/build-pov-ray

latestrev. 1

Evaluates the ability to locate, download, patch, and compile legacy POV-Ray 2.2 raytracer from 1990s source archives on a modern system.

build-tools, compilation, graphics, ray-tracing, legacy-software, research, software-engineering

Jeong Shin

See all tasks →