|
|
|
|
|
by peterbuch
78 days ago
|
|
Nice work. One thinga I'd love to see in the bench mark: a breakdown by question type (aggregations vs. multi-hop joins vs. lookups). My guess is the SQL approach pulls ahead hardest on the join-heavy ones, and showing that explicitly would make the "too good to be true" results feel more grounded. Either way, the token efficiency numbers sounds intruiging. |
|
80% of the benchmark questions are aggregations, 16% are multi-hop, 4% are lookups/subqueries.
Multi-hop is where LLMs struggle the most (hallucinations, partial answers), and aggregations is where you get the most token efficiency, since you skip on pagination which you need with APIs/MCPs that don't provide filters.