Datasets
| Dataset | Tasks |
|---|---|
| Dataset | Tasks |
|---|---|
DeepSearchQA is a 900-prompt factuality benchmark from Google DeepMind for evaluating deep research agents on difficult multi-step information-seeking tasks.
harbor run -d kgmon/deepsearchqaDeepSearchQA is a 900-prompt factuality benchmark from Google DeepMind for evaluating deep research agents on difficult multi-step information-seeking tasks across 17 fields.
Each Harbor task from the kgmon/deepsearchqa dataset uses a fixed instruction prompt instruction.md that asks the agent to answer one DeepSearchQA problem and write the final response to /workspace/answer.txt.
The task prompt template is:
Answer the following question. You may use the web or any available tools needed to answer the question.
Write your final answer to `/workspace/answer.txt`.
## Question
{problem}
The verifier is adapted from the Kaggle Evaluation Starter Code.
The verifier uses hidden metadata derived from DSQA-full.csv, including the
gold answer and answer type. These fields are not exposed in instruction.md.
It uses the official DeepSearchQA grading prompt and calls gemini-2.5-flash by default. Set one of these environment variables before running:
export GEMINI_API_KEY="..."
# or
export GOOGLE_API_KEY="..."
Optional override:
export DEEPSEARCHQA_GRADER_MODEL="gemini-2.5-flash"
The verifier writes reward.json with reward, fully_correct, fully_incorrect,
correct_with_excessive_answers, precision, recall, f1_score, and
grader_valid.
fully_correct: all expected answers were found with no excessive answers.fully_incorrect: none of the expected answers were found.correct_with_excessive_answers: all expected answers were found, but extra
answers were also supplied.The primary reward is per-task F1.
harbor run -p datasets/deepsearchqa -a <agent> -m "<model>"
DeepSearchQA tasks intentionally omit solution/solve.sh; the gold answer is
hidden in verifier metadata and is not part of the agent runtime.
DSQA-full.csv