Hacker News new | ask | show | jobs
by aaronSong 239 days ago
What I liked here is how unglamorous it is: tiny prompts, context only when needed, evaluate against real usage, and resist multi‑agent stuff until a single boring pipeline is stable. They also pair rule checks with model checks and expect some reward‑hacking. Curious how others keep evals from drifting in production.