Hacker News new | ask | show | jobs
by GorbachevyChase 39 days ago
Even given that I think solving the problem would require a certain amount of personal agency and volition to drive useful experimentation, and then you still have an inescapable problem that a design process is never verifiably done; it just a sense of taste when a product is good enough and it’s time to stop working on it.

I’m not sure this benchmark is even very interesting because it requires a language model do something that it really cannot do. Maybe it would be possible with a novel harness in an ensemble system, but I would never expect a pure language model that is run in a minimal harness to ever be able to do this.