|
|
|
|
|
by sillysaurusx
1168 days ago
|
|
10% smaller vocab size, or 10% fewer tokens on average? I assume the latter, but total vocab size is also an interesting metric. The tokenization speedups in that repo are very impressive. It was the most annoying part about processing 190,000 books. I think it took a few days on a server with 96 cores. Surprisingly hard to figure out the vocab size from that repo. |
|
The vocab size itself is doubled. (~50k for GPT-2/3, ~100k for ChatGPT)