Hacker News new | ask | show | jobs
by mordymoop 1596 days ago
When you toss “2241 + 19873 =” into an applet that shows you the default tokenization scheme GPT-3 uses, you get this:

(224)(1)( +)( 198)(73)( =)

I’ve heard it remarked before that, while tokenization is obviously an unavoidable part of a model with an architecture like GPT, this is a very silly way of tokenizing number strings for the purposes of learning or doing arithmetic. Indeed, I think a lot of GPT-3’s puzzling edge-case performance can be ascribed to weird and unhelpful tokenizations. Just imagine if you were forced to learn arithmetic with a brain that automatically categorized “224” as a sort of distinct object, or, for that matter, breaking down 19873 as ( 198)(73) rather than (19873) or (1)(9)(8)(7)(3) or anything practically useful.

The thing is that we can, in a sense, learn better “tokenizations”, in the sense that a 4 year old learning to read sees letters, while a 40 year old reading a novel “sees” whole words or even groups of words. The GPT architecture can’t change its tokenization scheme.

5 comments

When I do mental arithmetic my brain frequently tokenizes into digit pairs or triples if I can recognize pairs and triples that have specific properties.

"224" is actually a really nice object to recognize because it's 7 * 32, and if you can recognize other multiples of 32 it frequently gives you shortcuts. It's less useful for addition because you would need to get lucky and get a multiple of 32 (or 7) on both sides, but for multiplication and division it helps a lot.

Sure - I think we all learn tricks like that. But you learned that pattern of tokenization, it wasn't arbitrarily foisted on you.

What GPTs have to deal with is more like, you are fed an arithmetic problem via colored slips of paper, and you just have to remember that this particular shade of chartreuse means "224", which you happen to have memorized equals 7 * 32, etc., but then the next slip of paper is off-white which means "1", and now you have to mentally shift everything ...

The tokens in most gpt models are small like this, but they still 'learn tokenization' very similar to what you just mentioned. It's part of the multi headed attention.

It learns what level of detail in the tokenization is needed for given tasks. For example, If you're not interested in parsing the problem for actually doing the computation for example, you don't pay attention to the finer tokenization'. If you do need that level of detail, you use those finer groupings. Some of the difficulty a few years ago was trying to extend these models to handle longer contexts (or just variable contexts which can go to very long), but that also seems close to solved now too. So you're not exactly giving much insight with this observation.

I think that part of why the tokenization is a proble for math here is that it doesn't seem to be carrying overflow into the left token. Anyway, I haven't worked with GPT in detail to do a deeper analysis than that hunch, so take my comment with a couple of salt grains.
maybe this is a clue to which ones it succeeds on, and how it goes wrong when it does not.
Whoa, that explains why only .5% of the examples have an incorrect last digit.
It seems that we need another layer to tokenize according to context. I can see that breaking up a long number into 3 or 4 digits is the correct behaviour if we are dealing with phone numbers, but it'd be completely wrong if it's nearly anything else.
Many words have several semantic definitions depending on definition. This is why the word "is" is a very good token to have in a vocabulary (as an example), since it can mean so much depending on what tokens came before and after it.

Numbers have very limited semantic value. "123816" only means that number, and it's used very rarely in comparison to basically any other word (and the higher the number, the less chance of use, statistically peaking).

So the question becomes; to what extent do you expand the vocabulary using only numbers? "1", "2", "3", ... "1000000" would probably be a huge waste of words in an AI vocabulary (1MB input nodes), yet still not very impressive arithmetically even with 100% calculationrate. In comparison, a hand calculator from 30 years ago could do this with ease. It's not a question of being able to cleverly tokenize.

Calculations like this is an inherent flaw of vocabulary based AI until the semantic meaning of number sequences are somehow taught to it. Basically it needs to understand that "12" and "1" + "2" has the same contextular meaning, something which very rarely is explained in anything but 7 year old's schoolbooks. The problem is the dataset.

Gwern noted that adding things like thousands separators and $ signs to the input makes GPT significantly better at math.