|
We’ve open-sourced a benchmark for LLM-driven web agent setups. It evaluates real-world tasks, like logging in, scraping dashboards, and submitting forms, using structured criteria: success rate, latency, and task reliability. Everything is fully reproducible, with all outputs, logs, and evaluation data available. https://github.com/nottelabs/open-operator-evals Feedback, critiques, or contributions welcome:) |