gabeorlanski/slopcodebench
SlopCodeBench multi-checkpoint coding benchmark tasks converted for Harbor.
harbor run -d gabeorlanski/slopcodebenchSlopCodeBench Harbor Dataset
gabeorlanski/slopcodebench is the Harbor conversion of the
SlopCodeBench problem set.
SlopCodeBench evaluates coding agents under iterative specification refinement:
an agent starts with checkpoint 1, then repeatedly extends its own solution as
new checkpoint specifications are revealed. Later checkpoints keep earlier
requirements in force, so the benchmark measures both feature implementation and
regression resistance over time.
This dataset contains 36 multi-step Harbor tasks converted from the
SprocketLab/scb-problems
repository.
Links
- Paper: SlopCodeBench: Measuring Code Erosion Under Iterative Specification Refinement
- Website: scbench.ai
- Runner:
SprocketLab/slop-code-bench - Problems:
SprocketLab/scb-problems
Running the dataset
uvx harbor run -d "gabeorlanski/slopcodebench@latest" -a "<agent>" -m "<model>"
You can also run an individual task directly, for example:
uvx harbor run -t "gabeorlanski/file_backup@latest" -a "<agent>" -m "<model>"
Evaluation shape
Each task is a Harbor multi-step task. Every step corresponds to one SlopCodeBench checkpoint and includes:
- a checkpoint-specific instruction/specification;
- the verifier tests for that checkpoint, with prior checkpoint tests included where the source problem requires regression coverage;
- a reference solution payload used for validation/oracle runs.
Task-level rewards are reduced across steps with Harbor's multi-step mean
strategy. The dataset includes metric.py, which aggregates trial rewards into:
core_pass_rate_meanisolated_pass_rate_meanstrict_pass_rate_meanverbosity_meanerosion_meanverbosity_increase_rateerosion_increase_ratetrial_countmissing_trial_count
Tasks
| Task | Checkpoints | Difficulty | Category | Summary |
|---|---|---|---|---|
gabeorlanski/cfgpipe |
6 | easy | configuration-management | CLI configuration resolver that reads a JSON schema, resolves typed parameters from prioritized sources (default, env, file, primary/secondary stores, args), supports nested groups, watch mode with structured change events, advanced types (duration, pattern, map, list, redacted), and store prefix composition. |
gabeorlanski/circuit_eval |
8 | medium | simulation | CLI tool for digital circuit evaluation and optimization. Parses scalar and vector circuits in .circ, .json, and .bench formats. Evaluates circuits with 2-valued and 3-valued logic, generates truth tables, checks equivalence between circuits, and optimizes circuits with configurable passes (constant folding, CSE, DCE, etc.). |
gabeorlanski/code_search |
5 | easy | developer-tools | Multi-language code search tool (inspired by ast-grep) that finds patterns and applies refactorings. Starts with regex search in Python, adds AST-based pattern matching with metavariables, then auto-fix with conflict resolution. Supports Python, JS, C++, Rust, Java, Go, and Haskell. |
gabeorlanski/dag_execution |
3 | hard | dsl | Workflow orchestration system with a custom DSL for defining DAGs of tasks with dependencies and parameters. Includes a parser, execution engine, and JSONL logging. Adds caching with content-hashing and time-based strategies, then dynamic cache overrides per-task. |
gabeorlanski/database_migration |
5 | medium | databases | SQLite migration CLI. Starts with basic DDL (create table, add/drop columns), adds data transformations and backfills, then foreign keys/indexes/check constraints with rollback support, and finally dependency management with topological sorting and cycle detection. |
gabeorlanski/datagate |
7 | easy | web | HTTP service for ingesting tabular files from URLs/uploads into queryable datasets with pagination, sorting, filtering, export, caching, config-based runtime controls, access control, and optional enrichment metadata. |
gabeorlanski/dynamic_buffer |
4 | hard | data-processing | Code generator that infers data transformations from input/output examples and emits working code in Python, JS, C++, or Rust. Handles filtering, column ops, stateful transforms (prefix sums, sliding windows), and window functions. Generated code streams data with fixed buffers. |
gabeorlanski/dynamic_config_service_api |
4 | medium | web | REST API for versioned configs with inheritance and deep-merge. Adds JSON Schema validation and multi-format input (JSON/YAML/TOML), then approval workflows with drafts and quorum-based review, and finally OPA/Rego policy enforcement. |
gabeorlanski/env_manager |
5 | easy | cli-tools | Declarative system provisioning planner CLI that validates module configs and generates deterministic execution plans across macOS and Linux. Includes package/app installs, file actions, preferences, dock configuration, language runtime environments, profile manifests, and standalone build script generation. |
gabeorlanski/etl_pipeline |
5 | easy | data-processing | CLI that parses and executes ETL pipelines defined in JSON. Supports select, filter, map, rename, and limit operations with a custom expression language. Adds conditional branching, reusable sub-pipelines with parameters, and a library system for modular definitions. |
gabeorlanski/eve_industry |
6 | hard | data-processing | EVE Online manufacturing planner that parses the SDE to compute recipes, material costs, and build times. Adds invention probability calculations, ME/TE efficiency with waste tracking, full build planning with job scheduling, and recursive build-all with automatic job splitting. |
gabeorlanski/eve_jump_planner |
3 | medium | algorithms | Jump Freighter route planner for EVE Online. Calculates optimal routes with fuel costs, jump fatigue, and 3D spatial distances. Adds cloak-and-jump mechanics for extended range and handles high-sec destinations by finding nearby low-sec entry points. |
gabeorlanski/eve_market_tools |
4 | hard | web | REST API for EVE Online market data. Ingests market orders, builds price books, and provides regional stats and hub comparisons. Adds reprocessing yield calculations, minimum-cost ore optimization across hubs, and profit-finding for arbitrage and hauling. |
gabeorlanski/eve_route_planner |
3 | medium | algorithms | EVE Online route planner with realistic warp physics (acceleration/deceleration, gate locks). Adds cargo hauling with manifests and multi-trip planning, then contract optimization to select the most profitable jobs given time and ISK/jump constraints. |
gabeorlanski/execution_server |
6 | easy | networking | HTTP server that executes shell commands and returns results. Adds file tracking with globs, multi-format output support, command chains with hooks, caching, persistent environments with concurrency modes, and job scheduling with queues, templates, and dependency graphs. |
gabeorlanski/file_backup |
4 | easy | file-systems | Backup scheduler CLI that reads YAML configs to run scheduled backup jobs (daily/weekly/once) with glob exclusions. Supports full backups, tar packing, verification mode, and incremental backups using SHA-256 to skip unchanged files. |
gabeorlanski/file_merger |
4 | medium | data-processing | CLI that merges data files (CSV, TSV, JSONL, Parquet) into sorted, partitioned CSV output. Handles schema alignment, compression, and external sorting. Adds Hive-style partitioning, file sharding, and nested type support. |
gabeorlanski/file_query_tool |
5 | medium | data-processing | SQL engine for querying data files (CSV, Parquet, TSV, JSON). Supports joins, aggregations, filtering, glob patterns for sharded tables, window functions (ROW_NUMBER, RANK, etc.), CTEs, and subqueries. |
gabeorlanski/forge |
8 | easy | web | Command-line resource broker inspired by OpenStack Cyborg concepts. Manages blueprints, allocations, units, modules, tags, revision gating, and admin workflows with strict JSON contracts. |
gabeorlanski/l2m |
5 | easy | developer-tools | Command-line tool that converts LaTeX source files to KaTeX-compatible Markdown |
gabeorlanski/layered_config_synthesizer |
4 | medium | configuration-management | CLI that merges layered YAML/JSON configs for ML training with deterministic conflict resolution. Adds fragment expansion, env var interpolation, multi-run manifest processing, and JSON Schema validation. Outputs canonical JSON with sorted keys. |
gabeorlanski/log_query |
5 | medium | developer-tools | NDJSON query engine with custom SQL-like syntax. Implements filtering, aggregations with GROUP BY, multi-source joins (CONFLATE), schema mapping via GLOSS labels, and subqueries with custom keywords (POCKET, BEHOLDS, etc.). |
gabeorlanski/meshctl |
8 | hard | cli-tools | CLI tool for managing distributed cache mesh resources through declarative YAML specifications. Validates specs, applies defaults, persists state, and reports structured JSON to stdout. |
gabeorlanski/metric_transform_lang |
5 | hard | dsl | Interpreter for MTL, a DSL for processing event streams. Handles CSV/TSV/Parquet input, aggregations, window functions (lag/lead), joins with temporal constraints, and resumable execution. Output is deterministic JSON. |
gabeorlanski/migrate_configs |
5 | easy | cli-tools | Config migration CLI that applies transformation rules to JSON/YAML/TOML/INI files. Supports value replacement, key renaming, pattern matching, array filtering, config inheritance with cycle detection, file relocation, and pre-transformation validation. |
gabeorlanski/mocked_http |
8 | hard | web | Multi-protocol mock server with YAML-defined behaviors and admin controls. |
gabeorlanski/mvvault |
6 | medium | cli-tools | CLI tool for creating and maintaining local vaults that archive content metadata from an online media platform. Tracks timestamped field-level history across three catalog schema versions (v1: flat entries with UNIX-epoch keys, v2: categorized with ISO keys, v3: full format with removal tracking and annotations). Supports selective sync with media downloads, format-aware digest reports, a local HTTP viewer with version-adaptive routing and chart timestamp normalization, and annotation-triggered auto-migration. |
gabeorlanski/pwd_manager |
5 | medium | cli-tools | Interactive CLI password manager with encrypted local vault storage, master key unlock flow, search/add/edit/delete operations, category management, clipboard integration, tab completion, import/export, and vault locking controls. |
gabeorlanski/recli |
8 | hard | developer-tools | CLI framework with hierarchical command dispatch, argument validation, YAML configuration with inheritance, aliases, output formatting, file caching, SQLite persistence, container orchestration, version upgrade infrastructure, and system requirement checks. |
gabeorlanski/rejector |
5 | hard | data-processing | A synthetic data generation pipeline that maximizes throughput against a rate-limited LLM API. Supports multiple task types, generation schemes, in-context learning setups, agentic tool-call loops, and multi-provider routing. |
gabeorlanski/sheeteval |
7 | hard | developer-tools | CLI spreadsheet grader for .xlsx answer keys and student submissions. Supports typed literal checks, tolerance/alternates/penalties, formula grading, dependencies/fatal/concealed controls, minimum thresholds, check-mode scenario verification, and HTML report rendering. |
gabeorlanski/sith |
6 | hard | developer-tools | Advanced Python code intelligence CLI for static and interpreter-assisted analysis. Supports completion, inference, goto-definition, references, signatures, project search, refactors (rename/inline/extract), syntax diagnostics, environment discovery, scope context reporting, and project-level configuration/settings overrides. |
gabeorlanski/test_translator |
8 | hard | developer-tools | CLI test-harness translation engine that generates and runs language-specific tester files (Python, JavaScript, TypeScript) from a structured tests.py spec, with line-based test discovery, deep equality checking, JSON result output, and strict generate-before-test enforcement. |
gabeorlanski/textdrop |
6 | easy | web | Self-hosted text sharing HTTP service with strict boundary validation, markdown + metadata rendering, TOC/preview generation, static docs/assets, lifecycle auth/drain controls, signed per-user cookies, and pluggable local/object storage with startup/runtime failure contracts. |
gabeorlanski/trajectory_api |
5 | medium | web | REST API for storing ML agent trajectories with token/cost tracking, search, and reports. Adds mutable trajectories with ETag-based concurrency, forking with lineage tracking, EBNF grammar parsing for tool call extraction, and sandboxed Python/Bash execution. |
gabeorlanski/xjq |
5 | easy | developer-tools | CLI query tool for XML/HTML/JSON with XPath and CSS selectors, text extraction modes, file input precedence, and smart XML/JSON output formatting |
Source and license
The source problem specifications, tests, and reference solutions are maintained
in SprocketLab/scb-problems.
The benchmark runner is maintained in
SprocketLab/slop-code-bench.
This dataset follows the source repository's license and citation guidance.
Citation
If you use this dataset or benchmark, please cite the SlopCodeBench paper:
@article{Orlanski2025SlopCodeBench,
author = {Orlanski, Gabriel and Roy, Devjeet and Yun, Alexander and Shin, Changho and Gu, Alex and Ge, Albert and Adila, Dyah and Albarghouthi, Aws and Sala, Frederic},
title = {{SlopCodeBench: Measuring Code Erosion Under Iterative Specification Refinement}},
journal = {arXiv preprint arXiv:2603.24755},
year = {2025},
url = {https://arxiv.org/abs/2603.24755}
}