|
|
|
|
|
by ursAxZA
172 days ago
|
|
If a model eventually scores perfectly on every benchmark yet ends up practically useless, what’s the next step? Benchmarks measure competence inside a predefined problem space,
but real scientific and engineering work isn’t bounded — it keeps changing underneath you. At some point we don’t just need a system that knows how to solve problems in theory;
we need one that can actually do something with that ability. The equivalent of making the coffee when we want coffee,
not just getting a perfect score on a coffee-theory exam. |
|