We created API-Bench to test how well LLMs execute against APIs

How well can agents work with APIs they’ve never seen before? We tested 41 APIs across 8 different LLMs to find out.

API execution is great for benchmarking, because it tests core qualities and limitations of LLMs:_the depth of the data they were trained on, their stateless architecture, context dependency, and reasoning.

Today we're releasing v2 of API-Bench:_a benchmark that tests how well LLMs can execute against APIs. Here are the results: https://superglue.ai/benchmark_v2

Tl;dr:_LLMs fail at integrations because they lack ground truth, lack state, lack debugging ability, and lack access to real system context - everything API integrations fundamentally require.

Here’s what we found:

1. LLMs are only as good as the data they’re trained on:_when docs change, APIs evolve, or systems are niche/long-tail, they use outdated patterns, guess missing pieces and hallucinate endpoints and parameters.

2. LLMs are stateless, but integrations are stateful:_auth handshakes, pagination, retries, multi-step flows all need memory but LLMs can’t persist intermediate values or reason across steps.

3. LLMs produce code that “looks right” but fails at runtime: LLMs cannot isolate the failing step and understand real error messages, so they can’t change what’s broken or retry with new hypotheses.

4. LLMs can’t reliably interpret imperfect API design:_humans can infer the intended function, LLMs will hallucinate what looks reasonable.

We open sourced the benchmark so you can test your own APIs or contribute new ones: https://github.com/superglue-ai/superglue/tree/main/eval/llm...

Curious to hear your experience, and of course always happy to share more learnings.