Hacker News new | ask | show | jobs
by Kappa90 69 days ago
It's not explicitly stated in the benchmarks README, good catch.

80% of the benchmark questions are aggregations, 16% are multi-hop, 4% are lookups/subqueries.

Multi-hop is where LLMs struggle the most (hallucinations, partial answers), and aggregations is where you get the most token efficiency, since you skip on pagination which you need with APIs/MCPs that don't provide filters.