| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Havoc 757 days ago
	For things like this where we have computationally cheap, well understood, reliable tools available (aka calculator) it seems better to train the model in tool use. I guess perhaps the techniques could be generalized though?

3 comments

mike_hearn 757 days ago

Generalizable techniques is mostly the point of papers like this one yes. What they show here is that apparently fundamental problems with transformer reasoning can be fixed by encoding data in a more sophisticated manner. This is exciting. I've been thinking for a long time that the tokenization schemes are a low hanging fruit for improving coding LLM performance, this isn't exactly the same thing but it's in the same general area. Smartness and reasoning ability with the current set of algorithmic techniques seems to have topped out around GPT-4 level, which implies that further leaps in mental abilities must come from improving other things beyond training set size.

For example, whilst replacing the need for a calculator isn't very important, one obvious research direction would be to explore adding extra embeddings to code inputs, perhaps that are being computed by an IDE.

link

HarHarVeryFunny 757 days ago

It seems sub-word tokenization vs using character inputs is just a trade off to gain computational efficiency, and obviously isn't how our brain works. We're not born with a fixed visual tokenization scheme - we learn to create our own groupings and object representations.

However, transformers seem to struggle a bit with accurately manipulating sequences, so going to character inputs and hoping for those to be aggregated into words/numbers/etc might cause more problems than it solves?

I have to wonder if these models would not be better off learning whole-word embeddings rather than tokens. You'd have thought they would learn embeddings that encode any useful relatedness (e.g. corresponding to common prefixes) between words. Perhaps numbers would be better off input as a sequence of individual digit embeddings.

link

mike_hearn 756 days ago

Yeah a tiny vocab of characters doesn't work that well, it was tried very early on and creating large vocabs of tokens was a big improvement. Which makes sense. A lot of tokens are full words and so the token->embedding phase can quickly look up an embedding in vector space that contains a lot of meaning, whereas an embedding of 'z' or whatever is going to be meaningless.

link

HarHarVeryFunny 755 days ago

I guess this extends to numbers split across multiple tokens too (especially in the somewhat odd way the OpenAI tokenizer does it). The model is having to work really hard to learn what a given sequence of number chunks means (e.g. chunks '123' '45' vs '123' '4'). It somehow need to realize that the embedding for '4' represents a single-digit number, but the embedding for '45' represents a two-digit number, and this then correspondingly changes the meaning of the preceding '123' token!

It would have made it easier for the model to grok numbers if, similar to the proposed alternative, if 1234 was tokenized as '1000' '200' '30' '4' for powers of 10 up to some reasonable limit (then maybe '1^' '2^' after this reasonable limit). This would let the model easily grok human-sized numbers and need to work harder to grok, say, 20-digit ones, just the same as we do. Some early curriculum training, while not necessary, could then help it to quickly learn which embeddings represent numbers which are d * 10^1 vs d * 10^2, etc.

link

mike_hearn 754 days ago

That's sort of what this paper is doing. They add positional embeddings so the model can understand the positions of the digits inside the numbers better.

link

0-_-0 757 days ago

To me this finding shows how transformers don't generalise, since they need specialised embeddings to handle a problem

link

HarHarVeryFunny 757 days ago

I think this is more a matter of how numbers are input and lack of specific training, including visual training.

For example, the number 12,345,678 is input to ChatGPT as the three tokens "123" "456" "78", which isn't the best place to start to learn that this is an 8 digit number with specific digit positions!

https://platform.openai.com/tokenizer

As a human child you learn about numbers largely visually by pointing to units, tens, hundreds etc, visually aligning them to add, etc. Maybe a multi-modal model, if it was visually trained on chalkboard primary school math, would do better in learning the concept of position based powers of 10, etc.

link

Havoc 757 days ago

I'd say the key point here isn't that they "need" specialised embeddings, but rather that it improves things and it can samewhat manage without.

That's a far more surmountable problem. Maybe you need one model for biology and another for coding etc. i.e. Broad split by domain. Still weak AI not true general in AGI sense, but still seems like a good next step

link

int_19h 756 days ago

The fact that transformers generalize is kinda evident from the fact that they can solve novel puzzles.

link

verticalscaler 757 days ago

Creating the universe in 100 lines of code is the ultimate code golf and we have all been nerd sniped.

link