actava-ai/chi-bench

χ-Bench — long-horizon, policy-rich U.S. healthcare workflow agent benchmark across provider prior-auth, payer UM, and care management. 78 single-agent tasks on the hub (75 single-domain + 3 marathon); the 23 provider-payer E2E arena tasks need the two-agent dual-pa-e2e harness and run via the source repo, not the hub. Each task builds a self-contained image on demand (no hosted image/registry): Harbor clones the source repo and downloads the public fixtures dataset at build time. PREREQUISITES: Docker + Harbor CLI, and an APPROVED Hugging Face token for the gated Managed-Care Operations Handbook (every task needs it; fetched at container start). Request access: https://huggingface.co/datasets/actava/managed-care-operations-handbook RUN (HF_TOKEN from shell/--env-file; -y confirms, -i picks one task): HF_TOKEN=<approved-token> harbor run -d actava-ai/chi-bench@v1.0.1 -i actava-ai/pa_t016_t016_o001_p01_p2p_payer -a claude-code -m anthropic/claude-opus-4-7 -y LINKS — Code/runner/CLI: https://github.com/actava-ai/chi-bench · Fixtures dataset: https://huggingface.co/datasets/actava/chi-bench · Handbook (gated): https://huggingface.co/datasets/actava/managed-care-operations-handbook · Leaderboard: https://actava.ai/benchmarks/leaderboards · Paper: https://arxiv.org/abs/2605.16679 · Harbor-hub guide: https://github.com/actava-ai/chi-bench/blob/main/docs/harbor-hub.md

harbor run -d actava-ai/chi-bench

actava-ai/chi-bench

χ-Bench — a benchmark of long-horizon, policy-rich U.S. healthcare workflow agents across three domains: provider prior authorization, payer utilization management, and care management.

This Harbor hub listing carries 78 single-agent tasks (75 single-domain + 3 marathon long-horizon sessions). Each task ships a self-contained Dockerfile that Harbor builds on demand — cloning the source repo and downloading the public fixtures dataset at build time — so you can run a trial without cloning anything. No hosted image or registry is involved.

Prerequisites

  • Docker + the Harbor CLI.
  • An approved Hugging Face token for the gated Managed-Care Operations Handbook — every task needs it. The fixtures dataset is public; the handbook is downloaded at container start using your token. Request access at actava/managed-care-operations-handbook.

Run a task

HF_TOKEN is read from your shell (or --env-file); -y auto-confirms forwarding it into the container, and -i selects one task (omit to run all 78):

HF_TOKEN=<your-approved-hf-token> harbor run \
    -d actava-ai/chi-bench@v1.0.1 \
    -i actava-ai/pa_t016_t016_o001_p01_p2p_payer \
    -a claude-code -m anthropic/claude-opus-4-7 -y

The flag is not -e (that is the environment type), and --ae/--ek do not reach the entrypoint. Without an approved token the container exits early (code 78). Standard tasks are scored by task_runtime verify; marathon_* tasks run a long-horizon session scored by session_verifier. Reward is written to /logs/verifier/reward.json.

E2E is not on the hub

The 23 provider↔payer E2E arena tasks are intentionally excluded. The arena needs the two-agent dual-pa-e2e harness (provider phase → relay → payer phase), which a stock single-agent harbor run cannot drive. Run E2E from the source repo / HF dataset:

cb experiment run --dataset data/prior_auth_e2e/tasks/<id> \
    --agent dual-pa-e2e --provider-model <m> --payer-model <m>
# full arena: ./scripts/run_table.sh table2   (configs/experiments/table2_e2e_arena.yaml)

Where chi-Bench lives

Source code, runner, judge, CLI github.com/actava-ai/chi-bench
Fixtures + shared worlds (dataset) huggingface.co/datasets/actava/chi-bench
Managed-Care Handbook (gated) huggingface.co/datasets/actava/managed-care-operations-handbook
Leaderboard actava.ai/benchmarks/leaderboards
Paper arXiv:2605.16679
Harbor-hub guide docs/harbor-hub.md

For paper reproduction, the full experiment matrix, and leaderboard submission, use the cb CLI from the source repo. This hub listing is for ad-hoc runs and discovery.