Hacker News new | ask | show | jobs
by MountDoom 241 days ago
I remember people making the exact same argument about asking LLMs math questions back when they couldn't figure out the answer to 18 times 7. "They are text token predictors, they don't understand numbers, can we put this nonsense to rest."

The whole point of LLMs is that they do more than we suspected they could. And there is value in making them capable of handling a wider selection of tasks. When an LLM started to count the numbers of "r"s in "strawberry", OpenAI was taking a victory lap.

2 comments

They're better at maths now, but you still shouldn't ask them maths questions. Same as spelling - whether they improve or not doesn't matter if you want a specific, precise answer - it's the wrong tool and the better it does, the bigger the trap of it failing unexpectedly.
> When an LLM started to count the numbers of "r"s in "strawberry", OpenAI was taking a victory lap.

Were they? Or did they feel icky about spending way to much post-training time on such a specific and uninteresting skill?

It's not as specific of a skill as you would think. Being both aware of tokenizer limitations and capable of working around them is occasionally useful for real tasks.
What tasks would those be, that wouldn't be better served by using e.g. a Python script as a tool, possibly just as component of the complete solution?
Off the top of my head: the user wants LLM to help him solve a word puzzle. Think something a bit like Wordle, but less represented in its dataset.

For that, the LLM needs to be able to compare words character by character reliably. And to do that, it needs at least one of: be able to fully resolve the tokens to characters internally within one pass, know to emit the candidate words in a "1 character = 1 token" fashion and then compare that, or know that it should defer to tool calls and do that.

An LLM trained for better tokenization-awareness would be able to do that. The one that wasn't could fall into weird non-humanlike failures.

Surely there are algorithms to more effectively solve Wordles, and many other word puzzles, than LLMs? LLMs could stil be in the loop for generating words: LLM proposes words, deterministic algorithm tell the score according to the rules of the puzzle, or even augment the list by searching adjacent word space; then at some point LLM submits the guess.

Given wordle words are real words, I think this kind of loop could fare pretty well.

Your mistake is thinking that the user wants an algorithm that solves Wordles efficiently. Or that making and invoking a tool is always a more efficient solution.

As opposed to: the user is a 9 year old girl, and she has this puzzle in a smartphone game, and she can't figure out the answer, and the mom is busy, so she asks the AI, because the AI is never busy.

Now, for a single vaguely Wordle-like puzzle, how many tokens would it take to write and invoke a solver, and how many to just solve it - working around the tokenizer if necessary?

If you had a batch of 9000 puzzle questions, I can easily believe that writing and running a purpose specific solver would be more compute efficient. But if we're dealing with 1 puzzle question, and we're already invoking an LLM to interpret the natural language instructions for it? Nah.