|
|
|
|
|
by Kappa90
70 days ago
|
|
It's an experimental benchmark, I couldn't find any off-the-shelf benchmarks to use this with. There's Spider 2.0 but it's for text-to-SQL. I'm planning to run this [1] next but it's quite expensive. There's 75 questions, divided in 5 use case groups: revenue ops, e-commerce, knowledge bases, devops, support. I then generated a synthetic dataset with data mimicking APIs ranging from Stripe to Hubspot to Shopify to Zendesk etc.. I expose all the data through Dinobase vs. having one MCP per source e.g. one MCP for Stripe data, one MCP for Hubspot data etc. I tested this with 11 models, ranging from Kimi 2.5 to Claude Opus 4.6. Finally there's an LLM-as-a-judge that decides if the answer is correct, and I log latency and tokens. [1] https://arxiv.org/abs/2510.02938 |
|