Terminal-Bench
@terminal-bench
Datasets
Tasks
Evaluates the ability to implement an adaptive rejection sampler in R with proper statistical algorithms, modular design, input validation, log-concavity checking, and formal testing.
applied-statistics, adaptive-rejection-sampling, Bayesian-inference, simulation, scientific-computing
jvpoulos
Evaluates the ability to recover a Bayesian Network DAG structure from data, perform causal interventions, and sample from the modified network.
bayesian-network, stats, scientific-computing
Gabriel Dreiman
Evaluates the agent's ability to bypass an HTML sanitization filter by crafting malicious HTML that triggers JavaScript execution after filtering.
security
Nicholas Carlini
Evaluates the ability to compile and install a Python package with Cython extensions from source while fixing NumPy 2.x compatibility issues.
coding, dependency, compilation, debugging
Zizhao Chen
Evaluates the ability to build pMARS from Debian source packages with X11 support disabled, requiring Makefile modification and headless compilation.
build-tools, compilation, debian, gaming, pmars, corewars, software-engineering
Jeong Shin
Evaluates the ability to locate, download, patch, and compile legacy POV-Ray 2.2 raytracer from 1990s source archives on a modern system.
build-tools, compilation, graphics, ray-tracing, legacy-software, research, software-engineering
Jeong Shin