| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by sdiupIGPWEfh 110 days ago
	I was anticipating that having AI write code to pass tests (human and/or AI written tests) would be worthwhile, but in practice, I've found that even models such as Opus 4.6 Thinking, High Effort simply "cheats", or rather, fails to generalize much too often. It's occurred to me that perhaps I need some amount of randomness in the tests to keep the models honest, but it feels wrong. We'll see.