Hacker News new | ask | show | jobs
by mrfusion 1500 days ago
> Tokens are chunks of characters. For example, the word “alphabet” gets broken up into the tokens “alph" and "abet”.

I didn’t know that. Seems like it would confuse it during training. Anyone able to explain?

5 comments

If I recall correctly, it's similar to how fasttext vectors work. For fasttext, this means that the representation of words is dependent to a certain extent to its morphemes (not really, but bear with me), so rare/inflected words can have a better representation due to the similarity with words that are similar-looking and more frequent (e.g. "unconstitutional" might never appear in the training data, but the system can approximate its meaning by composing that of "un", which it has seen in words such as "unbelievable", and the remaining subtokens, that come from the word "constitutional" that was present in the training set)

Not sure if the same thing happens here, tho

I believe GPT-3 uses byte pair encoding, which allows it to do tokenization in a language-neutral manner:

https://en.wikipedia.org/wiki/Byte_pair_encoding

Yeah it's BPE. OpenAI has a nice tool that allows you to play with the tokenizer https://beta.openai.com/tokenizer.
Additionally, the tokenizer vocabulary is unchanged from GPT-2.

You can use HuggingFace's GPT-2 tokenizer as well. (some of OpenAI's GPT-3 notebooks do just that).

I thought I read it uses word2vec?
The alternatives are learning at the character level (way more complex, and scales badly in memory/compute), or learning at the whole word level (needs absurdly massive dictionary of words, and still can’t handle really rare/novel words). Breaking things into a set of subwords that allows you to encode any string solves lots of problems and is the relatively standard way to do things these days.
> The alternatives are learning at the character level (way more complex

No, BPEs are more complex: you have a whole additional layer of preprocessing, with all sorts of strange and counterintuitive downstream effects and brand new ways to screw up (fun quiz question: everyone knows that BPEs use '<|endoftext|>' tokens to denote document breaks; what does the string '<|endoftext|>' encode to?). BPEs are reliably one of the ways that OA API users screw up, especially when trying to work with longer completions or context windows.

But a character is a character.

> and scales badly in memory/compute)

Actually very competitive: https://arxiv.org/abs/2105.13626#google (Especially if you account for all the time and effort and subtle bugs caused by BPEs.)

Judging from the abstract, it looks like that paper talks about compute tradeoffs, but do they address memory tradeoffs? Because the context length limitations for (standard) transformers is holding them back from a whole host of applications, and memory being quadratic in sequence length seems like a hell of a cost to going from BPE tokens to characters.
You were paying that price to begin with, the BPEs don't magically resolve the quadratic. BPEs only compress by maybe 3x, and the larger the context window, the worse use a Transformer makes of it so the first 1024 or so characters are the most valuable (part of the problem is that document length drops off drastically in the training corpus). There are also many formulations of Transformer attention which change that quadratic (https://www.gwern.net/notes/Attention).
Humans also think about words in terms of subcomponents, languages make heavy use of prefixes and suffixes for example.
This is not the same.. The masks are randomized and lossy. Although yes there is potential for a transformer specially trained to segment prefixes/affixes/suffixes, it might augment some of its encoding abilities, see e.g spanbert for a related example of opportunity.
What do you mean with "lossy"? What information is being lost? Or do you just mean that there isn't necessarily a unique way to encode a given string?
I mean that information is being lost https://arxiv.org/abs/1906.08237 See xlnet for the rethoric https://www.microsoft.com/en-us/research/publication/mpnet-m... Or mpnet which attempt to combine the best of both worlds information wise but still find that masked modeling is much less useful than autoregressive.
This is masked token learning, which is used e.g by BERT. This is obscolete and alternatives such as XLNET are much superior but there is too much inertia in the industry and newer large models are still built with the same lossy encoding..