|
|
|
|
|
by mikkelam
87 days ago
|
|
Why are we treating LLM evaluation like a vibe check rather than an engineering problem? Most "Model X > Model Y" takes on HN these days (and everywhere) seem based on an hour of unscientific manual prompting. Are we actually running rigorous, version-controlled evals, or just making architectural decisions based on whether a model nailed a regex on the first try this morning? |
|
It's evident in the fact that every time AI has historically met certain thresholds (chess-playing, the Turing Test, fluent language), we play with them a little more and find out there's something still lacking.