| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by GorbachevyChase 39 days ago
	Even given that I think solving the problem would require a certain amount of personal agency and volition to drive useful experimentation, and then you still have an inescapable problem that a design process is never verifiably done; it just a sense of taste when a product is good enough and it’s time to stop working on it. I’m not sure this benchmark is even very interesting because it requires a language model do something that it really cannot do. Maybe it would be possible with a novel harness in an ensemble system, but I would never expect a pure language model that is run in a minimal harness to ever be able to do this.