|
|
|
|
|
by chaxor
1154 days ago
|
|
I know we don't have access to the details at OpenAI - but it does seem like there have been significant changes to the BPE token size over time. It seems there is a push towards much larger tokens than the previous ~3 char tokens (at least by behavior) |
|
https://huggingface.co/roberta-base/raw/main/merges.txt
(You have to scroll down a bit to get to the larger merges and image the lines without the spaces, which is what a string would look like after a merge.)
Also see GPT-2:
https://huggingface.co/gpt2/raw/main/merges.txt
I recently did some statistics. Average number of pieces per token (sampled on fairly large data, these are all models that use BBPE):
RoBERTa base (English): 1.08
RobBERT (Dutch): 1.21
roberta-base-ca-v2 (Catalan): 1.12
ukr-models/xlm-roberta-base-uk (Ukrainian): 1.68
In all these cases, the median token length in pieces was 1.
(Note: I am not debating that newer OpenAI models don't use a larger vocab. I just want to show that older BBPE models didn't use 3 char pieces. They were 1 piece per token for most tokens.)