SlopCodeBench Harbor Dataset

gabeorlanski/slopcodebench is the Harbor conversion of the SlopCodeBench problem set. SlopCodeBench evaluates coding agents under iterative specification refinement: an agent starts with checkpoint 1, then repeatedly extends its own solution as new checkpoint specifications are revealed. Later checkpoints keep earlier requirements in force, so the benchmark measures both feature implementation and regression resistance over time.

This dataset contains 36 multi-step Harbor tasks converted from the SprocketLab/scb-problems repository.

Running the dataset

uvx harbor run -d "gabeorlanski/slopcodebench@latest" -a "<agent>" -m "<model>"

You can also run an individual task directly, for example:

uvx harbor run -t "gabeorlanski/file_backup@latest" -a "<agent>" -m "<model>"

Evaluation shape

Each task is a Harbor multi-step task. Every step corresponds to one SlopCodeBench checkpoint and includes:

a checkpoint-specific instruction/specification;
the verifier tests for that checkpoint, with prior checkpoint tests included where the source problem requires regression coverage;
a reference solution payload used for validation/oracle runs.

Task-level rewards are reduced across steps with Harbor's multi-step mean strategy. The dataset includes metric.py, which aggregates trial rewards into:

core_pass_rate_mean
isolated_pass_rate_mean
strict_pass_rate_mean
verbosity_mean
erosion_mean
verbosity_increase_rate
erosion_increase_rate
trial_count
missing_trial_count

Tasks

Task	Checkpoints	Difficulty	Category	Summary
`gabeorlanski/cfgpipe`	6	easy	configuration-management	CLI configuration resolver that reads a JSON schema, resolves typed parameters from prioritized sources (default, env, file, primary/secondary stores, args), supports nested groups, watch mode with structured change events, advanced types (duration, pattern, map, list, redacted), and store prefix composition.
`gabeorlanski/circuit_eval`	8	medium	simulation	CLI tool for digital circuit evaluation and optimization. Parses scalar and vector circuits in .circ, .json, and .bench formats. Evaluates circuits with 2-valued and 3-valued logic, generates truth tables, checks equivalence between circuits, and optimizes circuits with configurable passes (constant folding, CSE, DCE, etc.).
`gabeorlanski/code_search`	5	easy	developer-tools	Multi-language code search tool (inspired by ast-grep) that finds patterns and applies refactorings. Starts with regex search in Python, adds AST-based pattern matching with metavariables, then auto-fix with conflict resolution. Supports Python, JS, C++, Rust, Java, Go, and Haskell.
`gabeorlanski/dag_execution`	3	hard	dsl	Workflow orchestration system with a custom DSL for defining DAGs of tasks with dependencies and parameters. Includes a parser, execution engine, and JSONL logging. Adds caching with content-hashing and time-based strategies, then dynamic cache overrides per-task.
`gabeorlanski/database_migration`	5	medium	databases	SQLite migration CLI. Starts with basic DDL (create table, add/drop columns), adds data transformations and backfills, then foreign keys/indexes/check constraints with rollback support, and finally dependency management with topological sorting and cycle detection.
`gabeorlanski/datagate`	7	easy	web	HTTP service for ingesting tabular files from URLs/uploads into queryable datasets with pagination, sorting, filtering, export, caching, config-based runtime controls, access control, and optional enrichment metadata.
`gabeorlanski/dynamic_buffer`	4	hard	data-processing	Code generator that infers data transformations from input/output examples and emits working code in Python, JS, C++, or Rust. Handles filtering, column ops, stateful transforms (prefix sums, sliding windows), and window functions. Generated code streams data with fixed buffers.
`gabeorlanski/dynamic_config_service_api`	4	medium	web	REST API for versioned configs with inheritance and deep-merge. Adds JSON Schema validation and multi-format input (JSON/YAML/TOML), then approval workflows with drafts and quorum-based review, and finally OPA/Rego policy enforcement.
`gabeorlanski/env_manager`	5	easy	cli-tools	Declarative system provisioning planner CLI that validates module configs and generates deterministic execution plans across macOS and Linux. Includes package/app installs, file actions, preferences, dock configuration, language runtime environments, profile manifests, and standalone build script generation.
`gabeorlanski/etl_pipeline`	5	easy	data-processing	CLI that parses and executes ETL pipelines defined in JSON. Supports select, filter, map, rename, and limit operations with a custom expression language. Adds conditional branching, reusable sub-pipelines with parameters, and a library system for modular definitions.
`gabeorlanski/eve_industry`	6	hard	data-processing	EVE Online manufacturing planner that parses the SDE to compute recipes, material costs, and build times. Adds invention probability calculations, ME/TE efficiency with waste tracking, full build planning with job scheduling, and recursive build-all with automatic job splitting.
`gabeorlanski/eve_jump_planner`	3	medium	algorithms	Jump Freighter route planner for EVE Online. Calculates optimal routes with fuel costs, jump fatigue, and 3D spatial distances. Adds cloak-and-jump mechanics for extended range and handles high-sec destinations by finding nearby low-sec entry points.
`gabeorlanski/eve_market_tools`	4	hard	web	REST API for EVE Online market data. Ingests market orders, builds price books, and provides regional stats and hub comparisons. Adds reprocessing yield calculations, minimum-cost ore optimization across hubs, and profit-finding for arbitrage and hauling.
`gabeorlanski/eve_route_planner`	3	medium	algorithms	EVE Online route planner with realistic warp physics (acceleration/deceleration, gate locks). Adds cargo hauling with manifests and multi-trip planning, then contract optimization to select the most profitable jobs given time and ISK/jump constraints.
`gabeorlanski/execution_server`	6	easy	networking	HTTP server that executes shell commands and returns results. Adds file tracking with globs, multi-format output support, command chains with hooks, caching, persistent environments with concurrency modes, and job scheduling with queues, templates, and dependency graphs.
`gabeorlanski/file_backup`	4	easy	file-systems	Backup scheduler CLI that reads YAML configs to run scheduled backup jobs (daily/weekly/once) with glob exclusions. Supports full backups, tar packing, verification mode, and incremental backups using SHA-256 to skip unchanged files.
`gabeorlanski/file_merger`	4	medium	data-processing	CLI that merges data files (CSV, TSV, JSONL, Parquet) into sorted, partitioned CSV output. Handles schema alignment, compression, and external sorting. Adds Hive-style partitioning, file sharding, and nested type support.
`gabeorlanski/file_query_tool`	5	medium	data-processing	SQL engine for querying data files (CSV, Parquet, TSV, JSON). Supports joins, aggregations, filtering, glob patterns for sharded tables, window functions (ROW_NUMBER, RANK, etc.), CTEs, and subqueries.
`gabeorlanski/forge`	8	easy	web	Command-line resource broker inspired by OpenStack Cyborg concepts. Manages blueprints, allocations, units, modules, tags, revision gating, and admin workflows with strict JSON contracts.
`gabeorlanski/l2m`	5	easy	developer-tools	Command-line tool that converts LaTeX source files to KaTeX-compatible Markdown
`gabeorlanski/layered_config_synthesizer`	4	medium	configuration-management	CLI that merges layered YAML/JSON configs for ML training with deterministic conflict resolution. Adds fragment expansion, env var interpolation, multi-run manifest processing, and JSON Schema validation. Outputs canonical JSON with sorted keys.
`gabeorlanski/log_query`	5	medium	developer-tools	NDJSON query engine with custom SQL-like syntax. Implements filtering, aggregations with GROUP BY, multi-source joins (CONFLATE), schema mapping via GLOSS labels, and subqueries with custom keywords (POCKET, BEHOLDS, etc.).
`gabeorlanski/meshctl`	8	hard	cli-tools	CLI tool for managing distributed cache mesh resources through declarative YAML specifications. Validates specs, applies defaults, persists state, and reports structured JSON to stdout.
`gabeorlanski/metric_transform_lang`	5	hard	dsl	Interpreter for MTL, a DSL for processing event streams. Handles CSV/TSV/Parquet input, aggregations, window functions (lag/lead), joins with temporal constraints, and resumable execution. Output is deterministic JSON.
`gabeorlanski/migrate_configs`	5	easy	cli-tools	Config migration CLI that applies transformation rules to JSON/YAML/TOML/INI files. Supports value replacement, key renaming, pattern matching, array filtering, config inheritance with cycle detection, file relocation, and pre-transformation validation.
`gabeorlanski/mocked_http`	8	hard	web	Multi-protocol mock server with YAML-defined behaviors and admin controls.
`gabeorlanski/mvvault`	6	medium	cli-tools	CLI tool for creating and maintaining local vaults that archive content metadata from an online media platform. Tracks timestamped field-level history across three catalog schema versions (v1: flat entries with UNIX-epoch keys, v2: categorized with ISO keys, v3: full format with removal tracking and annotations). Supports selective sync with media downloads, format-aware digest reports, a local HTTP viewer with version-adaptive routing and chart timestamp normalization, and annotation-triggered auto-migration.
`gabeorlanski/pwd_manager`	5	medium	cli-tools	Interactive CLI password manager with encrypted local vault storage, master key unlock flow, search/add/edit/delete operations, category management, clipboard integration, tab completion, import/export, and vault locking controls.
`gabeorlanski/recli`	8	hard	developer-tools	CLI framework with hierarchical command dispatch, argument validation, YAML configuration with inheritance, aliases, output formatting, file caching, SQLite persistence, container orchestration, version upgrade infrastructure, and system requirement checks.
`gabeorlanski/rejector`	5	hard	data-processing	A synthetic data generation pipeline that maximizes throughput against a rate-limited LLM API. Supports multiple task types, generation schemes, in-context learning setups, agentic tool-call loops, and multi-provider routing.
`gabeorlanski/sheeteval`	7	hard	developer-tools	CLI spreadsheet grader for .xlsx answer keys and student submissions. Supports typed literal checks, tolerance/alternates/penalties, formula grading, dependencies/fatal/concealed controls, minimum thresholds, check-mode scenario verification, and HTML report rendering.
`gabeorlanski/sith`	6	hard	developer-tools	Advanced Python code intelligence CLI for static and interpreter-assisted analysis. Supports completion, inference, goto-definition, references, signatures, project search, refactors (rename/inline/extract), syntax diagnostics, environment discovery, scope context reporting, and project-level configuration/settings overrides.
`gabeorlanski/test_translator`	8	hard	developer-tools	CLI test-harness translation engine that generates and runs language-specific tester files (Python, JavaScript, TypeScript) from a structured tests.py spec, with line-based test discovery, deep equality checking, JSON result output, and strict generate-before-test enforcement.
`gabeorlanski/textdrop`	6	easy	web	Self-hosted text sharing HTTP service with strict boundary validation, markdown + metadata rendering, TOC/preview generation, static docs/assets, lifecycle auth/drain controls, signed per-user cookies, and pluggable local/object storage with startup/runtime failure contracts.
`gabeorlanski/trajectory_api`	5	medium	web	REST API for storing ML agent trajectories with token/cost tracking, search, and reports. Adds mutable trajectories with ETag-based concurrency, forking with lineage tracking, EBNF grammar parsing for tool call extraction, and sandboxed Python/Bash execution.
`gabeorlanski/xjq`	5	easy	developer-tools	CLI query tool for XML/HTML/JSON with XPath and CSS selectors, text extraction modes, file input precedence, and smart XML/JSON output formatting

Source and license

The source problem specifications, tests, and reference solutions are maintained in SprocketLab/scb-problems. The benchmark runner is maintained in SprocketLab/slop-code-bench.

This dataset follows the source repository's license and citation guidance.

Citation

If you use this dataset or benchmark, please cite the SlopCodeBench paper:

@article{Orlanski2025SlopCodeBench,
  author = {Orlanski, Gabriel and Roy, Devjeet and Yun, Alexander and Shin, Changho and Gu, Alex and Ge, Albert and Adila, Dyah and Albarghouthi, Aws and Sala, Frederic},
  title = {{SlopCodeBench: Measuring Code Erosion Under Iterative Specification Refinement}},
  journal = {arXiv preprint arXiv:2603.24755},
  year = {2025},
  url = {https://arxiv.org/abs/2603.24755}
}

gabeorlanski/slopcodebench