| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Kappa90 116 days ago

It's not explicitly stated in the benchmarks README, good catch.

80% of the benchmark questions are aggregations, 16% are multi-hop, 4% are lookups/subqueries.

Multi-hop is where LLMs struggle the most (hallucinations, partial answers), and aggregations is where you get the most token efficiency, since you skip on pagination which you need with APIs/MCPs that don't provide filters.