Hacker News new | ask | show | jobs
by dist-epoch 1155 days ago
> The input string is tokenized into a sequence of token indices (integers)

How is this tokenization done? Sometimes a single word can be two tokens. My understanding is that the token indices are also learned, but by whom? The same transformer? Another neural network?

2 comments

Huggingface have good guides on tokenization, and tokenizer training. BPE (e.g. used by gpt) and wordpiece (e.g. used by bert) are two commonly used methods https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt
The tokenization is done by the tokenizer which can be thought of as just a function that maps strings to integers before the neural network. Tokenizers can be hand-specified or learned, but in either case this is typically done separately from training the model. It is also less frequently necessary unless you are dealing with an entirely new input type/language.

Tokenizers can be quite gnarly internally. https://huggingface.co/learn/nlp-course/chapter6/5?fw=pt is a good resource on BPE tokenization.