|
|
|
|
|
by the_snooze
67 days ago
|
|
Two things can be true at the same time: The technology has improved, and the technology in its current state still isn't fit for purpose. I stress test commercially deployed LLMs like Gemini and Claude with trivial tasks: sports trivia, fixing recipes, explaining board game rules, etc. It works well like 95% of the time. That's fine for inconsequential things. But you'd have to be deeply irresponsible to accept that kind of error rate on things that actually matter. The most intellectually honest way to evaluate these things is how they behave now on real tasks. Not with some unfalsifiable appeal to the future of "oh, they'll fix it." |
|