| I'm seeing this too. I have a SQL agent and my tests with 3.5 are resulting in hitting query budget limits that have never been hit before. On average, to answer the same question, 3.5 is spending 10x more on SQL queries vs gemini-3-flash-preview. The query patterns can be extremely degenerate too. E.g. the agent will hit the semantic layer tool to pull the schema, then run `SELECT * FROM table LIMIT 1`, which hits the query budget limit and fails. I've only really been looking this morning, so I need to do a full eval, but the initial results match what your benchmark shows. --- Side note: your benchmark has an issue. On Q1 medium the model returned gross margin of 0.127 instead of 12.7 (%), and the benchmark failed it. The failures on Q9 and Q21 are the same (I didn't check other questions). Nowhere in the prompt did you specify you wanted the values converted to percentage points and rounded. If you asked me to write that SQL with that prompt, unless you were throwing it directly into a visualization I would format it the same way gemini-flash did. If I were pulling into a spreadsheet or vis tool this format is preferable because it's easier to format in a client application. The other failures like Q21 incorrectly averaging the list price are correct failures. |