RExBench - 2 tasks (cogs, othello) evaluating AI agents' ability to extend existing AI research
through research experiment implementation. Tasks require an A100 40GB GPU and may take multiple
hours (cogs: ~2.5h oracle; othello: ~45min oracle).
Parity: openhands@1.4.0 / anthropic/claude-sonnet-4-5, 3 runs — cogs: 100% Harbor vs 66.67%
original; othello: 100% Harbor vs 100% original.
Adapter by Nicholas Edwards (nicholas.edwards@univie.ac.at), one of the original RExBench authors.
Acknowledgements: We thank 2077 AI for generously funding API credits to support running parity experiments.
2 tasksNicholas Edwards