It's an experimental benchmark, I couldn't find any off-the-shelf benchmarks to use this with. There's Spider 2.0 but it's for text-to-SQL. I'm planning to run this [1] next but it's quite expensive.
There's 75 questions, divided in 5 use case groups: revenue ops, e-commerce, knowledge bases, devops, support.
I then generated a synthetic dataset with data mimicking APIs ranging from Stripe to Hubspot to Shopify to Zendesk etc..
I expose all the data through Dinobase vs. having one MCP per source e.g. one MCP for Stripe data, one MCP for Hubspot data etc.
I tested this with 11 models, ranging from Kimi 2.5 to Claude Opus 4.6.
Finally there's an LLM-as-a-judge that decides if the answer is correct, and I log latency and tokens.
There's 75 questions, divided in 5 use case groups: revenue ops, e-commerce, knowledge bases, devops, support.
I then generated a synthetic dataset with data mimicking APIs ranging from Stripe to Hubspot to Shopify to Zendesk etc..
I expose all the data through Dinobase vs. having one MCP per source e.g. one MCP for Stripe data, one MCP for Hubspot data etc.
I tested this with 11 models, ranging from Kimi 2.5 to Claude Opus 4.6.
Finally there's an LLM-as-a-judge that decides if the answer is correct, and I log latency and tokens.
[1] https://arxiv.org/abs/2510.02938