| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by svnt 358 days ago
	It is interesting to think about how they are achieving these scores. The evals are rated by GPT-4.1. Beyond just overfitting to benchmarks, is it possible the models are internalizing how to manipulate the ratings model/agent? Is anyone manually auditing these performance tables?