|
|
|
|
|
by Kappa90
69 days ago
|
|
It's not explicitly stated in the benchmarks README, good catch. 80% of the benchmark questions are aggregations, 16% are multi-hop, 4% are lookups/subqueries. Multi-hop is where LLMs struggle the most (hallucinations, partial answers), and aggregations is where you get the most token efficiency, since you skip on pagination which you need with APIs/MCPs that don't provide filters. |
|