| HN Mirror

It's an experimental benchmark, I couldn't find any off-the-shelf benchmarks to use this with. There's Spider 2.0 but it's for text-to-SQL. I'm planning to run this [1] next but it's quite expensive.

There's 75 questions, divided in 5 use case groups: revenue ops, e-commerce, knowledge bases, devops, support.

I then generated a synthetic dataset with data mimicking APIs ranging from Stripe to Hubspot to Shopify to Zendesk etc..

I expose all the data through Dinobase vs. having one MCP per source e.g. one MCP for Stripe data, one MCP for Hubspot data etc.

I tested this with 11 models, ranging from Kimi 2.5 to Claude Opus 4.6.

Finally there's an LLM-as-a-judge that decides if the answer is correct, and I log latency and tokens.

[1] https://arxiv.org/abs/2510.02938