|
Hey HN! We built Agent Runner, a model-agnostic, open-source agent harness that executes the same prompt against two anonymized coding agents in parallel sandboxes. Each agent can make tool calls, edit multiple files, and self-correct through iterative reasoning. You pick the better result - this becomes the ground truth for the leaderboard. Why we built it
Traditional benchmarks often fall short for modern agentic systems: they rely on static tasks and only measure final outputs. But real coding agents modify multiple files across a repo, answer to user re-prompts, use tool calls, and recover from partial failures What Agent Runner does
You ask it to build anything
Agent Runner kicks off two generations from different sandboxed LLM providers (OpenAI, Anthropic, Google, xAI, Mistral, Kimi, and more)
Anonymized models make tool calls, multi-file edits, and cater to reprompts
You pick your favorite - this preference powers the benchmark Because different providers handle tool calls, prompts, and execution semantics differently, we worked with each provider to ensure configurations reflect intended behavior. These provider-specific setups remain private, but Agent Runner itself is open-source. How to try it
Kick off Agent Runner at https://www.designarena.ai/agentarena
Repo at https://github.com/Design-Arena/agent-runner
Use it as a CLI tool: https://pypi.org/project/agent-runner/
pip install agent-runner
agentrunner run “create a nextjs replica of Discord” We hope this provides a provider-agnostic, framework-agnostic, realistic benchmark for state-of-the-art coding agents. Video demo: https://youtu.be/rdtiuCHatjs |