NovitaAI/tb21-systems-security

Terminal-Bench 2.1 subset for systems, operations, security, cryptanalysis, and assistant-style operational tasks, curated for Novita Sandbox Hackathon.

Published 6/4/2026 by Alex

New Job

harbor run -d NovitaAI/tb21-systems-security

TB2.1 Systems & Security Track

Systems administration, web/server operations, security, cryptanalysis, vulnerability repair, and operational assistant tasks from Terminal-Bench 2.1.

This public Harbor dataset is curated by NovitaAI for a Novita Sandbox hackathon track. It is a category-based subset of Terminal-Bench 2.1. Agents and models are not fixed; the only required runtime environment for the hackathon is Novita Sandbox.

Dataset

Harbor dataset: NovitaAI/tb21-systems-security
Track size: 22 tasks
Source benchmark: Terminal-Bench 2.1
Included source categories: system-administration, security, mathematics, personal-assistant
Required hackathon sandbox: -e novita

Quick Start

Run the full track once:

harbor run \
  -d NovitaAI/tb21-systems-security \
  -a <agent> \
  -m <model> \
  -e novita \
  -k 1 \
  -n 1 \
  -y

Run a small smoke test from the track:

harbor run \
  -d NovitaAI/tb21-systems-security \
  -a <agent> \
  -m <model> \
  -e novita \
  -l 1 \
  -k 1 \
  -n 1 \
  -y

Upload a public result for the hackathon leaderboard:

harbor upload jobs/<job_name> --public

Submit the resulting Harbor Hub job link to the hackathon leaderboard form.

Valid Submission Rules

A valid track submission should satisfy:

The Harbor job uses this dataset: NovitaAI/tb21-systems-security.
The job config has environment.type = "novita".
The job does not use extra hints or task-specific extra instructions.
The submitted Harbor Hub job is public.
Agent and model are free choice unless a specific event round says otherwise.

Suggested ranking fields:

Primary: mean reward
Tie-breaker 1: fewer exceptions/errors
Tie-breaker 2: lower average duration
Tie-breaker 3: lower output tokens or total tokens, if the event wants an efficiency prize

Tasks

terminal-bench/break-filter-js-from-html
terminal-bench/compile-compcert
terminal-bench/configure-git-webserver
terminal-bench/constraints-scheduling
terminal-bench/crack-7z-hash
terminal-bench/feal-differential-cryptanalysis
terminal-bench/feal-linear-cryptanalysis
terminal-bench/filter-js-from-html
terminal-bench/fix-code-vulnerability
terminal-bench/git-multibranch
terminal-bench/install-windows-3.11
terminal-bench/largest-eigenval
terminal-bench/mailman
terminal-bench/model-extraction-relu-logits
terminal-bench/nginx-request-logging
terminal-bench/openssl-selfsigned-cert
terminal-bench/password-recovery
terminal-bench/qemu-alpine-ssh
terminal-bench/qemu-startup
terminal-bench/sanitize-git-repo
terminal-bench/sqlite-with-gcov
terminal-bench/vulnerable-secret

⌘K

Task
terminal-bench/qemu-alpine-ssh
terminal-bench/openssl-selfsigned-cert
terminal-bench/model-extraction-relu-logits
terminal-bench/filter-js-from-html
terminal-bench/vulnerable-secret
terminal-bench/nginx-request-logging
terminal-bench/configure-git-webserver
terminal-bench/compile-compcert
terminal-bench/sanitize-git-repo
terminal-bench/qemu-startup
terminal-bench/feal-differential-cryptanalysis
terminal-bench/password-recovery
terminal-bench/install-windows-3.11
terminal-bench/largest-eigenval
terminal-bench/break-filter-js-from-html
terminal-bench/git-multibranch
terminal-bench/sqlite-with-gcov
terminal-bench/feal-linear-cryptanalysis
terminal-bench/crack-7z-hash
terminal-bench/mailman
terminal-bench/constraints-scheduling
terminal-bench/fix-code-vulnerability

Displaying 22 of 22 tasks

NovitaAI/tb21-systems-security

Terminal-Bench 2.1 subset for systems, operations, security, cryptanalysis, and assistant-style operational tasks, curated for Novita Sandbox Hackathon.

Published 6/4/2026 by Alex

New Job

harbor run -d NovitaAI/tb21-systems-security

TB2.1 Systems & Security Track

Systems administration, web/server operations, security, cryptanalysis, vulnerability repair, and operational assistant tasks from Terminal-Bench 2.1.

Dataset

Harbor dataset: NovitaAI/tb21-systems-security
Track size: 22 tasks
Source benchmark: Terminal-Bench 2.1
Included source categories: system-administration, security, mathematics, personal-assistant
Required hackathon sandbox: -e novita

Quick Start

Run the full track once:

harbor run \
  -d NovitaAI/tb21-systems-security \
  -a <agent> \
  -m <model> \
  -e novita \
  -k 1 \
  -n 1 \
  -y

Run a small smoke test from the track:

harbor run \
  -d NovitaAI/tb21-systems-security \
  -a <agent> \
  -m <model> \
  -e novita \
  -l 1 \
  -k 1 \
  -n 1 \
  -y

Upload a public result for the hackathon leaderboard:

harbor upload jobs/<job_name> --public

Submit the resulting Harbor Hub job link to the hackathon leaderboard form.

Valid Submission Rules

A valid track submission should satisfy:

The Harbor job uses this dataset: NovitaAI/tb21-systems-security.
The job config has environment.type = "novita".
The job does not use extra hints or task-specific extra instructions.
The submitted Harbor Hub job is public.
Agent and model are free choice unless a specific event round says otherwise.

Suggested ranking fields:

Primary: mean reward
Tie-breaker 1: fewer exceptions/errors
Tie-breaker 2: lower average duration
Tie-breaker 3: lower output tokens or total tokens, if the event wants an efficiency prize

Tasks

terminal-bench/break-filter-js-from-html
terminal-bench/compile-compcert
terminal-bench/configure-git-webserver
terminal-bench/constraints-scheduling
terminal-bench/crack-7z-hash
terminal-bench/feal-differential-cryptanalysis
terminal-bench/feal-linear-cryptanalysis
terminal-bench/filter-js-from-html
terminal-bench/fix-code-vulnerability
terminal-bench/git-multibranch
terminal-bench/install-windows-3.11
terminal-bench/largest-eigenval
terminal-bench/mailman
terminal-bench/model-extraction-relu-logits
terminal-bench/nginx-request-logging
terminal-bench/openssl-selfsigned-cert
terminal-bench/password-recovery
terminal-bench/qemu-alpine-ssh
terminal-bench/qemu-startup
terminal-bench/sanitize-git-repo
terminal-bench/sqlite-with-gcov
terminal-bench/vulnerable-secret

⌘K

Task
terminal-bench/qemu-alpine-ssh
terminal-bench/openssl-selfsigned-cert
terminal-bench/model-extraction-relu-logits
terminal-bench/filter-js-from-html
terminal-bench/vulnerable-secret
terminal-bench/nginx-request-logging
terminal-bench/configure-git-webserver
terminal-bench/compile-compcert
terminal-bench/sanitize-git-repo
terminal-bench/qemu-startup
terminal-bench/feal-differential-cryptanalysis
terminal-bench/password-recovery
terminal-bench/install-windows-3.11
terminal-bench/largest-eigenval
terminal-bench/break-filter-js-from-html
terminal-bench/git-multibranch
terminal-bench/sqlite-with-gcov
terminal-bench/feal-linear-cryptanalysis
terminal-bench/crack-7z-hash
terminal-bench/mailman
terminal-bench/constraints-scheduling
terminal-bench/fix-code-vulnerability

Displaying 22 of 22 tasks

TB2.1 Systems & Security Track

Systems administration, web/server operations, security, cryptanalysis, vulnerability repair, and operational assistant tasks from Terminal-Bench 2.1.

Dataset

Harbor dataset: NovitaAI/tb21-systems-security
Track size: 22 tasks
Source benchmark: Terminal-Bench 2.1
Included source categories: system-administration, security, mathematics, personal-assistant
Required hackathon sandbox: -e novita

Quick Start

Run the full track once:

harbor run \
  -d NovitaAI/tb21-systems-security \
  -a <agent> \
  -m <model> \
  -e novita \
  -k 1 \
  -n 1 \
  -y

Run a small smoke test from the track:

harbor run \
  -d NovitaAI/tb21-systems-security \
  -a <agent> \
  -m <model> \
  -e novita \
  -l 1 \
  -k 1 \
  -n 1 \
  -y

Upload a public result for the hackathon leaderboard:

harbor upload jobs/<job_name> --public

Submit the resulting Harbor Hub job link to the hackathon leaderboard form.

Valid Submission Rules

A valid track submission should satisfy:

The Harbor job uses this dataset: NovitaAI/tb21-systems-security.
The job config has environment.type = "novita".
The job does not use extra hints or task-specific extra instructions.
The submitted Harbor Hub job is public.
Agent and model are free choice unless a specific event round says otherwise.

Suggested ranking fields:

Primary: mean reward
Tie-breaker 1: fewer exceptions/errors
Tie-breaker 2: lower average duration
Tie-breaker 3: lower output tokens or total tokens, if the event wants an efficiency prize

Tasks

terminal-bench/break-filter-js-from-html
terminal-bench/compile-compcert
terminal-bench/configure-git-webserver
terminal-bench/constraints-scheduling
terminal-bench/crack-7z-hash
terminal-bench/feal-differential-cryptanalysis
terminal-bench/feal-linear-cryptanalysis
terminal-bench/filter-js-from-html
terminal-bench/fix-code-vulnerability
terminal-bench/git-multibranch
terminal-bench/install-windows-3.11
terminal-bench/largest-eigenval
terminal-bench/mailman
terminal-bench/model-extraction-relu-logits
terminal-bench/nginx-request-logging
terminal-bench/openssl-selfsigned-cert
terminal-bench/password-recovery
terminal-bench/qemu-alpine-ssh
terminal-bench/qemu-startup
terminal-bench/sanitize-git-repo
terminal-bench/sqlite-with-gcov
terminal-bench/vulnerable-secret