grafana/o11y-bench
o11y-bench: an open agentic observability benchmark. Measures how well AI agents perform 63 real-world observability tasks across logs, metrics, traces, dashboards, and incident workflows.
harbor run -d grafana/o11y-benchgrafana/o11y-bench
o11y-bench is an open agentic observability benchmark from Grafana.
It measures how well AI agents perform 63 real-world observability tasks across logs, metrics, traces, dashboards, and incident workflows.
Source, documentation, and usage instructions are available in the public GitHub repository: