| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by HarHarVeryFunny 515 days ago

You're the one out of your depth ...

LLMs are taught to predict. Once they've seen enough training samples of words being spelled, they'll have learnt that in a spelling context the tokens comprising the word predict the tokens comprising the spelling.

Once they've learnt the letters predicted by each token, they'll be able to do this for any word (i.e. token sequence).

Of course, you could just try it for yourself - ask an LLM to break a non-dictionary nonsense word like "asdpotyg" into a letter sequence.

2 comments

famouswaffles 514 days ago

Have you seen the Byte-latent Transformer paper?

It does away with sub-word tokenization but is still more or less a transformer (no working memory or internal iteration). Mostly, the (performance) gains seem modest (not unanimous, some benchmarks it's a bit worse) ....until you hit anything to do with character level manipulation and it just stomps. 1.1% to 99% on CUTE - Spelling as a particularly egregious example.

I'm not sure what the problem is exactly but clearly something about sub-word tokenization is giving these models a particularly hard time on these sort of tasks.

https://arxiv.org/abs/2412.09871

link

HarHarVeryFunny 514 days ago

The CUTE benchmark is interesting, but doesn't have enough examples of the actual prompts used and model outputs to be able to evaluate the results. Obviously transformers internally manipulate their input at token level granularity, so to be successful at character level manipulation they first need to generate the character level token sequence, THEN do the manipulation. Prompting them to directly output a result without allowing them to first generate the character sequence would therefore guarantee bad performance, so it'd be important to see the details.

https://arxiv.org/pdf/2409.15452

link

danielmarkbruce 515 days ago

> Once they've learnt the letters predicted by each token, they'll be able to do this for any word (i.e. token sequence).

They often fail at things like this, hence the strawberry example. Because they can't break down a token or have any concept of it. There is a sort of sweat spot where it's really hard (like strawberry). The example you give above is so far from a real word that it gets tokenized into lots of tokens, ie it's almost character level tokenization. You also have the fact that none of the mainstream chat apps are blindly shoving things into a model. They are almost certainly routing that to a split function.

link

HarHarVeryFunny 515 days ago

You're still not getting it ...

Why would an LLM need to "break down" tokens into letters to do spelling?! That is just not how they work - they work by PREDICTION. If you ask an LLM to break a word into a sequence of letters, it is NOT trying to break it into a sequence of letters - it is trying to do the only thing it was trained to do, which is to predict what tokens (based on the training samples) most likely follow such a request, something that it can easily learn given a few examples in the training set.

link

danielmarkbruce 515 days ago

The LLM can't, thats what makes it relatively difficult. The tokenizer can.

Run it through your head with character level tokenization. Imagine the attention calculations. See how easy it would be? See how few samples would be required? It's a trivial thing when the tokenizer breaks everything down to characters.

Consider the amount and specificity of training data required to learn spelling 'games' using current tokenization schemes. Vocabularies of 100,000 plus tokens, many of which are close together in high dimensional space but spelled very differently. Then consider the various data sets which give phonetic information as a method to spell. They'd be tokenized in ways which confuse a model.

Look, maybe go build one. Your head will spin once you start dealing with the various types of training data and how different tokenization changes things. It screws spelling, math, code, technical biology material, financial material. I specifically build models for financial markets and it's an issue.

link

HarHarVeryFunny 514 days ago

> I specifically build models for financial markets and it's an issue.

Well, as you can verify for yourself, LLMs can spell just fine, even if you choose to believe that they are doing so by black magic or tool use rather than learnt prediction.

So, whatever problems you are having with your financial models isn't because they can't spell.

link

HarHarVeryFunny 514 days ago

You seem to think that predicting s t -> s t is easier than predicting st (single token) -> s t.

Of all the incredible things that LLMs can do, why do you imagine that something so basic is challenging to them?

In a trillion token training set, how few examples of spelling are you thinking there are?

Given all the specialized data that is deliberately added to training sets to boost performance in specific areas, are you assuming that it might not occur to them to add coverage of token spellings if it was needed ?!

Why are you relying on what you believe to be true, rather than just firing up a bunch of models and trying it for yourself ?

link

danielmarkbruce 514 days ago

> You seem to think that predicting s t -> s t is easier than predicting st (single token) -> s t.

Yes, it is significantly easier to train a model to do the first than the second across any real vocabulary. If you don't understand why, maybe go back to basics.

link

HarHarVeryFunny 514 days ago

No, because it still has to learn what to predict when "spelling" is called for. There's no magic just because the predicted token sequence is the same as the predicting one (+/- any quotes, commas, etc).

And ...

1) If the training data isn't there, it still won't learn it

2) Having to learn that the predictive signal is a multi-token pattern (s t) vs a single token one (st) isn't making things any simpler for the model.

Clearly you've decided to go based on personal belief rather that actually testing for yourself, so the conversation is rather pointless.

link