|
|
|
|
|
by pranabsarkar
57 days ago
|
|
Fixed eval — 80 tools, 200 queries, 4 model sizes. +10pp came from "all tools" vs "tiered" on 1.5B. You're right about stability. Haven't run rotated/rephrased evals yet. The 89% baseline (when models knew where to look) suggests selection capability is fine, but I'd expect some regression with adversarial prompts. |
|