| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by pranabsarkar 105 days ago
	Fixed eval — 80 tools, 200 queries, 4 model sizes. +10pp came from "all tools" vs "tiered" on 1.5B. You're right about stability. Haven't run rotated/rephrased evals yet. The 89% baseline (when models knew where to look) suggests selection capability is fine, but I'd expect some regression with adversarial prompts.