| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by aaronSong 239 days ago
	What I liked here is how unglamorous it is: tiny prompts, context only when needed, evaluate against real usage, and resist multi‑agent stuff until a single boring pipeline is stable. They also pair rule checks with model checks and expect some reward‑hacking. Curious how others keep evals from drifting in production.