actava-ai/chi-bench
χ-Bench — long-horizon, policy-rich U.S. healthcare workflow agent benchmark across provider prior-auth, payer UM, and care management. 78 single-agent tasks on the hub (75 single-domain + 3 marathon); the 23 provider-payer E2E arena tasks need the two-agent dual-pa-e2e harness and run via the source repo, not the hub. Each task builds a self-contained image on demand (no hosted image/registry): Harbor clones the source repo and downloads the public fixtures dataset at build time. PREREQUISITES: Docker + Harbor CLI, and an APPROVED Hugging Face token for the gated Managed-Care Operations Handbook (every task needs it; fetched at container start). Request access: https://huggingface.co/datasets/actava/managed-care-operations-handbook RUN (HF_TOKEN from shell/--env-file; -y confirms, -i picks one task): HF_TOKEN=<approved-token> harbor run -d actava-ai/chi-bench@v1.0.1 -i actava-ai/pa_t016_t016_o001_p01_p2p_payer -a claude-code -m anthropic/claude-opus-4-7 -y LINKS — Code/runner/CLI: https://github.com/actava-ai/chi-bench · Fixtures dataset: https://huggingface.co/datasets/actava/chi-bench · Handbook (gated): https://huggingface.co/datasets/actava/managed-care-operations-handbook · Leaderboard: https://actava.ai/benchmarks/leaderboards · Paper: https://arxiv.org/abs/2605.16679 · Harbor-hub guide: https://github.com/actava-ai/chi-bench/blob/main/docs/harbor-hub.md
harbor run -d actava-ai/chi-benchactava-ai/chi-bench
χ-Bench — a benchmark of long-horizon, policy-rich U.S. healthcare workflow agents across three domains: provider prior authorization, payer utilization management, and care management.
This Harbor hub listing carries 78 single-agent tasks (75 single-domain + 3 marathon long-horizon sessions). Each task ships a self-contained Dockerfile that Harbor builds on demand — cloning the source repo and downloading the public fixtures dataset at build time — so you can run a trial without cloning anything. No hosted image or registry is involved.
Prerequisites
- Docker + the Harbor CLI.
- An approved Hugging Face token for the gated Managed-Care Operations Handbook — every task needs it. The fixtures dataset is public; the handbook is downloaded at container start using your token. Request access at actava/managed-care-operations-handbook.
Run a task
HF_TOKEN is read from your shell (or --env-file); -y auto-confirms forwarding it into the container, and -i selects one task (omit to run all 78):
HF_TOKEN=<your-approved-hf-token> harbor run \
-d actava-ai/chi-bench@v1.0.1 \
-i actava-ai/pa_t016_t016_o001_p01_p2p_payer \
-a claude-code -m anthropic/claude-opus-4-7 -y
The flag is not -e (that is the environment type), and --ae/--ek do not reach the entrypoint. Without an approved token the container exits early (code 78). Standard tasks are scored by task_runtime verify; marathon_* tasks run a long-horizon session scored by session_verifier. Reward is written to /logs/verifier/reward.json.
E2E is not on the hub
The 23 provider↔payer E2E arena tasks are intentionally excluded. The arena needs the two-agent dual-pa-e2e harness (provider phase → relay → payer phase), which a stock single-agent harbor run cannot drive. Run E2E from the source repo / HF dataset:
cb experiment run --dataset data/prior_auth_e2e/tasks/<id> \
--agent dual-pa-e2e --provider-model <m> --payer-model <m>
# full arena: ./scripts/run_table.sh table2 (configs/experiments/table2_e2e_arena.yaml)
Where chi-Bench lives
| Source code, runner, judge, CLI | github.com/actava-ai/chi-bench |
| Fixtures + shared worlds (dataset) | huggingface.co/datasets/actava/chi-bench |
| Managed-Care Handbook (gated) | huggingface.co/datasets/actava/managed-care-operations-handbook |
| Leaderboard | actava.ai/benchmarks/leaderboards |
| Paper | arXiv:2605.16679 |
| Harbor-hub guide | docs/harbor-hub.md |
For paper reproduction, the full experiment matrix, and leaderboard submission, use the
cbCLI from the source repo. This hub listing is for ad-hoc runs and discovery.