| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by peterbuch 78 days ago
	Nice work. One thinga I'd love to see in the bench mark: a breakdown by question type (aggregations vs. multi-hop joins vs. lookups). My guess is the SQL approach pulls ahead hardest on the join-heavy ones, and showing that explicitly would make the "too good to be true" results feel more grounded. Either way, the token efficiency numbers sounds intruiging.

1 comments

Kappa90 78 days ago

It's not explicitly stated in the benchmarks README, good catch.

80% of the benchmark questions are aggregations, 16% are multi-hop, 4% are lookups/subqueries.

Multi-hop is where LLMs struggle the most (hallucinations, partial answers), and aggregations is where you get the most token efficiency, since you skip on pagination which you need with APIs/MCPs that don't provide filters.

link