gabeorlanski/slopcodebench

SlopCodeBench multi-checkpoint coding benchmark tasks converted for Harbor.

harbor run -d gabeorlanski/slopcodebench

SlopCodeBench Harbor Dataset

gabeorlanski/slopcodebench is the Harbor conversion of the SlopCodeBench problem set. SlopCodeBench evaluates coding agents under iterative specification refinement: an agent starts with checkpoint 1, then repeatedly extends its own solution as new checkpoint specifications are revealed. Later checkpoints keep earlier requirements in force, so the benchmark measures both feature implementation and regression resistance over time.

This dataset contains 36 multi-step Harbor tasks converted from the SprocketLab/scb-problems repository.

Links

Running the dataset

uvx harbor run -d "gabeorlanski/slopcodebench@latest" -a "<agent>" -m "<model>"

You can also run an individual task directly, for example:

uvx harbor run -t "gabeorlanski/file_backup@latest" -a "<agent>" -m "<model>"

Evaluation shape

Each task is a Harbor multi-step task. Every step corresponds to one SlopCodeBench checkpoint and includes:

  • a checkpoint-specific instruction/specification;
  • the verifier tests for that checkpoint, with prior checkpoint tests included where the source problem requires regression coverage;
  • a reference solution payload used for validation/oracle runs.

Task-level rewards are reduced across steps with Harbor's multi-step mean strategy. The dataset includes metric.py, which aggregates trial rewards into:

  • core_pass_rate_mean
  • isolated_pass_rate_mean
  • strict_pass_rate_mean
  • verbosity_mean
  • erosion_mean
  • verbosity_increase_rate
  • erosion_increase_rate
  • trial_count
  • missing_trial_count

Tasks

Task Checkpoints Difficulty Category Summary
gabeorlanski/cfgpipe 6 easy configuration-management CLI configuration resolver that reads a JSON schema, resolves typed parameters from prioritized sources (default, env, file, primary/secondary stores, args), supports nested groups, watch mode with structured change events, advanced types (duration, pattern, map, list, redacted), and store prefix composition.
gabeorlanski/circuit_eval 8 medium simulation CLI tool for digital circuit evaluation and optimization. Parses scalar and vector circuits in .circ, .json, and .bench formats. Evaluates circuits with 2-valued and 3-valued logic, generates truth tables, checks equivalence between circuits, and optimizes circuits with configurable passes (constant folding, CSE, DCE, etc.).
gabeorlanski/code_search 5 easy developer-tools Multi-language code search tool (inspired by ast-grep) that finds patterns and applies refactorings. Starts with regex search in Python, adds AST-based pattern matching with metavariables, then auto-fix with conflict resolution. Supports Python, JS, C++, Rust, Java, Go, and Haskell.
gabeorlanski/dag_execution 3 hard dsl Workflow orchestration system with a custom DSL for defining DAGs of tasks with dependencies and parameters. Includes a parser, execution engine, and JSONL logging. Adds caching with content-hashing and time-based strategies, then dynamic cache overrides per-task.
gabeorlanski/database_migration 5 medium databases SQLite migration CLI. Starts with basic DDL (create table, add/drop columns), adds data transformations and backfills, then foreign keys/indexes/check constraints with rollback support, and finally dependency management with topological sorting and cycle detection.
gabeorlanski/datagate 7 easy web HTTP service for ingesting tabular files from URLs/uploads into queryable datasets with pagination, sorting, filtering, export, caching, config-based runtime controls, access control, and optional enrichment metadata.
gabeorlanski/dynamic_buffer 4 hard data-processing Code generator that infers data transformations from input/output examples and emits working code in Python, JS, C++, or Rust. Handles filtering, column ops, stateful transforms (prefix sums, sliding windows), and window functions. Generated code streams data with fixed buffers.
gabeorlanski/dynamic_config_service_api 4 medium web REST API for versioned configs with inheritance and deep-merge. Adds JSON Schema validation and multi-format input (JSON/YAML/TOML), then approval workflows with drafts and quorum-based review, and finally OPA/Rego policy enforcement.
gabeorlanski/env_manager 5 easy cli-tools Declarative system provisioning planner CLI that validates module configs and generates deterministic execution plans across macOS and Linux. Includes package/app installs, file actions, preferences, dock configuration, language runtime environments, profile manifests, and standalone build script generation.
gabeorlanski/etl_pipeline 5 easy data-processing CLI that parses and executes ETL pipelines defined in JSON. Supports select, filter, map, rename, and limit operations with a custom expression language. Adds conditional branching, reusable sub-pipelines with parameters, and a library system for modular definitions.
gabeorlanski/eve_industry 6 hard data-processing EVE Online manufacturing planner that parses the SDE to compute recipes, material costs, and build times. Adds invention probability calculations, ME/TE efficiency with waste tracking, full build planning with job scheduling, and recursive build-all with automatic job splitting.
gabeorlanski/eve_jump_planner 3 medium algorithms Jump Freighter route planner for EVE Online. Calculates optimal routes with fuel costs, jump fatigue, and 3D spatial distances. Adds cloak-and-jump mechanics for extended range and handles high-sec destinations by finding nearby low-sec entry points.
gabeorlanski/eve_market_tools 4 hard web REST API for EVE Online market data. Ingests market orders, builds price books, and provides regional stats and hub comparisons. Adds reprocessing yield calculations, minimum-cost ore optimization across hubs, and profit-finding for arbitrage and hauling.
gabeorlanski/eve_route_planner 3 medium algorithms EVE Online route planner with realistic warp physics (acceleration/deceleration, gate locks). Adds cargo hauling with manifests and multi-trip planning, then contract optimization to select the most profitable jobs given time and ISK/jump constraints.
gabeorlanski/execution_server 6 easy networking HTTP server that executes shell commands and returns results. Adds file tracking with globs, multi-format output support, command chains with hooks, caching, persistent environments with concurrency modes, and job scheduling with queues, templates, and dependency graphs.
gabeorlanski/file_backup 4 easy file-systems Backup scheduler CLI that reads YAML configs to run scheduled backup jobs (daily/weekly/once) with glob exclusions. Supports full backups, tar packing, verification mode, and incremental backups using SHA-256 to skip unchanged files.
gabeorlanski/file_merger 4 medium data-processing CLI that merges data files (CSV, TSV, JSONL, Parquet) into sorted, partitioned CSV output. Handles schema alignment, compression, and external sorting. Adds Hive-style partitioning, file sharding, and nested type support.
gabeorlanski/file_query_tool 5 medium data-processing SQL engine for querying data files (CSV, Parquet, TSV, JSON). Supports joins, aggregations, filtering, glob patterns for sharded tables, window functions (ROW_NUMBER, RANK, etc.), CTEs, and subqueries.
gabeorlanski/forge 8 easy web Command-line resource broker inspired by OpenStack Cyborg concepts. Manages blueprints, allocations, units, modules, tags, revision gating, and admin workflows with strict JSON contracts.
gabeorlanski/l2m 5 easy developer-tools Command-line tool that converts LaTeX source files to KaTeX-compatible Markdown
gabeorlanski/layered_config_synthesizer 4 medium configuration-management CLI that merges layered YAML/JSON configs for ML training with deterministic conflict resolution. Adds fragment expansion, env var interpolation, multi-run manifest processing, and JSON Schema validation. Outputs canonical JSON with sorted keys.
gabeorlanski/log_query 5 medium developer-tools NDJSON query engine with custom SQL-like syntax. Implements filtering, aggregations with GROUP BY, multi-source joins (CONFLATE), schema mapping via GLOSS labels, and subqueries with custom keywords (POCKET, BEHOLDS, etc.).
gabeorlanski/meshctl 8 hard cli-tools CLI tool for managing distributed cache mesh resources through declarative YAML specifications. Validates specs, applies defaults, persists state, and reports structured JSON to stdout.
gabeorlanski/metric_transform_lang 5 hard dsl Interpreter for MTL, a DSL for processing event streams. Handles CSV/TSV/Parquet input, aggregations, window functions (lag/lead), joins with temporal constraints, and resumable execution. Output is deterministic JSON.
gabeorlanski/migrate_configs 5 easy cli-tools Config migration CLI that applies transformation rules to JSON/YAML/TOML/INI files. Supports value replacement, key renaming, pattern matching, array filtering, config inheritance with cycle detection, file relocation, and pre-transformation validation.
gabeorlanski/mocked_http 8 hard web Multi-protocol mock server with YAML-defined behaviors and admin controls.
gabeorlanski/mvvault 6 medium cli-tools CLI tool for creating and maintaining local vaults that archive content metadata from an online media platform. Tracks timestamped field-level history across three catalog schema versions (v1: flat entries with UNIX-epoch keys, v2: categorized with ISO keys, v3: full format with removal tracking and annotations). Supports selective sync with media downloads, format-aware digest reports, a local HTTP viewer with version-adaptive routing and chart timestamp normalization, and annotation-triggered auto-migration.
gabeorlanski/pwd_manager 5 medium cli-tools Interactive CLI password manager with encrypted local vault storage, master key unlock flow, search/add/edit/delete operations, category management, clipboard integration, tab completion, import/export, and vault locking controls.
gabeorlanski/recli 8 hard developer-tools CLI framework with hierarchical command dispatch, argument validation, YAML configuration with inheritance, aliases, output formatting, file caching, SQLite persistence, container orchestration, version upgrade infrastructure, and system requirement checks.
gabeorlanski/rejector 5 hard data-processing A synthetic data generation pipeline that maximizes throughput against a rate-limited LLM API. Supports multiple task types, generation schemes, in-context learning setups, agentic tool-call loops, and multi-provider routing.
gabeorlanski/sheeteval 7 hard developer-tools CLI spreadsheet grader for .xlsx answer keys and student submissions. Supports typed literal checks, tolerance/alternates/penalties, formula grading, dependencies/fatal/concealed controls, minimum thresholds, check-mode scenario verification, and HTML report rendering.
gabeorlanski/sith 6 hard developer-tools Advanced Python code intelligence CLI for static and interpreter-assisted analysis. Supports completion, inference, goto-definition, references, signatures, project search, refactors (rename/inline/extract), syntax diagnostics, environment discovery, scope context reporting, and project-level configuration/settings overrides.
gabeorlanski/test_translator 8 hard developer-tools CLI test-harness translation engine that generates and runs language-specific tester files (Python, JavaScript, TypeScript) from a structured tests.py spec, with line-based test discovery, deep equality checking, JSON result output, and strict generate-before-test enforcement.
gabeorlanski/textdrop 6 easy web Self-hosted text sharing HTTP service with strict boundary validation, markdown + metadata rendering, TOC/preview generation, static docs/assets, lifecycle auth/drain controls, signed per-user cookies, and pluggable local/object storage with startup/runtime failure contracts.
gabeorlanski/trajectory_api 5 medium web REST API for storing ML agent trajectories with token/cost tracking, search, and reports. Adds mutable trajectories with ETag-based concurrency, forking with lineage tracking, EBNF grammar parsing for tool call extraction, and sandboxed Python/Bash execution.
gabeorlanski/xjq 5 easy developer-tools CLI query tool for XML/HTML/JSON with XPath and CSS selectors, text extraction modes, file input precedence, and smart XML/JSON output formatting

Source and license

The source problem specifications, tests, and reference solutions are maintained in SprocketLab/scb-problems. The benchmark runner is maintained in SprocketLab/slop-code-bench.

This dataset follows the source repository's license and citation guidance.

Citation

If you use this dataset or benchmark, please cite the SlopCodeBench paper:

@article{Orlanski2025SlopCodeBench,
  author = {Orlanski, Gabriel and Roy, Devjeet and Yun, Alexander and Shin, Changho and Gu, Alex and Ge, Albert and Adila, Dyah and Albarghouthi, Aws and Sala, Frederic},
  title = {{SlopCodeBench: Measuring Code Erosion Under Iterative Specification Refinement}},
  journal = {arXiv preprint arXiv:2603.24755},
  year = {2025},
  url = {https://arxiv.org/abs/2603.24755}
}