Hacker News new | ask | show | jobs
by danielmarkbruce 515 days ago
The LLM can't, thats what makes it relatively difficult. The tokenizer can.

Run it through your head with character level tokenization. Imagine the attention calculations. See how easy it would be? See how few samples would be required? It's a trivial thing when the tokenizer breaks everything down to characters.

Consider the amount and specificity of training data required to learn spelling 'games' using current tokenization schemes. Vocabularies of 100,000 plus tokens, many of which are close together in high dimensional space but spelled very differently. Then consider the various data sets which give phonetic information as a method to spell. They'd be tokenized in ways which confuse a model.

Look, maybe go build one. Your head will spin once you start dealing with the various types of training data and how different tokenization changes things. It screws spelling, math, code, technical biology material, financial material. I specifically build models for financial markets and it's an issue.

2 comments

> I specifically build models for financial markets and it's an issue.

Well, as you can verify for yourself, LLMs can spell just fine, even if you choose to believe that they are doing so by black magic or tool use rather than learnt prediction.

So, whatever problems you are having with your financial models isn't because they can't spell.

You seem to think that predicting s t -> s t is easier than predicting st (single token) -> s t.

Of all the incredible things that LLMs can do, why do you imagine that something so basic is challenging to them?

In a trillion token training set, how few examples of spelling are you thinking there are?

Given all the specialized data that is deliberately added to training sets to boost performance in specific areas, are you assuming that it might not occur to them to add coverage of token spellings if it was needed ?!

Why are you relying on what you believe to be true, rather than just firing up a bunch of models and trying it for yourself ?

> You seem to think that predicting s t -> s t is easier than predicting st (single token) -> s t.

Yes, it is significantly easier to train a model to do the first than the second across any real vocabulary. If you don't understand why, maybe go back to basics.

No, because it still has to learn what to predict when "spelling" is called for. There's no magic just because the predicted token sequence is the same as the predicting one (+/- any quotes, commas, etc).

And ...

1) If the training data isn't there, it still won't learn it

2) Having to learn that the predictive signal is a multi-token pattern (s t) vs a single token one (st) isn't making things any simpler for the model.

Clearly you've decided to go based on personal belief rather that actually testing for yourself, so the conversation is rather pointless.

Go try it. I've done it.

You are going to find for 1) with character level tokenization you don't need to have data for every token for it to learn. For current tokenization schemes you do, and it still goes haywire from time to time when tokens which are close in space are spelled very differently.

Just try it, actually training one yourself.

I don't doubt that training an LLM, and curating a training set, is a black art. Conventional wisdom was that up until a few years ago there were only a few dozen people in the world who knew all the tricks.

However, that is not what we were discussing.

You keep flip flopping on how you think these successfully trained frontier models are working and managing to predict the character level sequences represented by multi-character tokens ... one minute you say it's due to having learnt from an onerous amount of data, and the next you say they must be using a split function (if that's the silver bullet, then why are you not using one yourself, I wonder).

Near the top of this thread you opined that failure to count r's in strawberry is "Because they can't break down a token or have any concept of it". It's a bit like saying that birds can't fly because they don't know how to apply Bernoulli's principle. Wrong conclusion, irrelevant logic. At least now you seem to have progressed to (on occasion) admitting that they may learn to predict token -> character sequences given enough data.

If I happen into a few million dollars of spare cash, maybe I will try to train a frontier model, but frankly it seems a bit of an expensive way to verify that if done correctly it'd be able to spell "strawberry", even if using a penny-pinching tokenization scheme.

Nope, the right analogy is: "it's like saying a model will find it difficult to tell you what's inside a box because it can't see inside it". Shaking it, weighing it, measuring if it produces some magnetic field or whatever is what LLMs are currently doing, and often well.

The discussion was around the difficulty of doing it with current tokenization schemes v character level. No one said it was impossible. It's possible to train an LLM to do arithmetic with decent sized numbers - it's difficult to do it well.

You don't need to spend more than a few hundred dollars to train a model to figure something like this out. In fact, you don't need to spend any money at all. If you are willing to step through small model layer by layer, it obvious.