| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by BoorishBears 68 days ago
	Their benchmark is chock-full of things like that: It's deeply flawed and is essentially rating how LLMs perform if you exert yourself trying to hold them entirely the wrong way.