| Hi HN! Adina here from superglue. Today I’d like to share a new benchmark we’ve just open sourced: an Agent-API Benchmark, in which we test how well LLMs handle APIs. tl;dr: LLMs suck at writing code to use APIs. We ran 630 integration tests across 21 common APIs (Stripe, Slack, GitHub, etc.) using 6 different LLMs. Here are our key findings:
- Best general LLM: 68% success rate. That's 1 in 3 API calls failing. Would you ship that?
- Our integration layer scored a 91% success rate, showing us that just throwing bigger/better LLMs at the problem won't solve it.
- Only 6 out of 21 APIs worked 100% of the time, every other API had failures.
- Anthropic’s models are significantly better at building API integrations than other providers. What makes LLMs fail hard:
- Lack of context (LLMs are just not great at understanding what API endpoints exist and what they do, even if you give them documentation which we did)
- Multi-step workflows (chaining API calls)
- Complex API design: APIs like Square, PostHog, Asana (Forcing project selection among other things trips llms over) We've open-sourced the benchmark so you can test any API and see where it ranks: https://github.com/superglue-ai/superglue/tree/main/packages... Check out the repo, consider giving it a star, or see the full ranking at https://superglue.ai/api-ranking/ If you're building agents that need reliable API access, we'd love to hear your approach - or you can try our integration layer at superglue.ai. Next up: benchmarking MCP. |