aarri(act as a real research intern)-bench, for evaluating LLM agents in academic research tasks
harbor run -d aarr/aarri-bench