| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by warwickmcintosh 80 days ago
	LLM as judge drifts in weird ways if you don't have ground truth to calibrate against. Good that you've got that built in. Would love to see eval stability tracking over time though, same prompt different day sometimes gives different scores.