Hacker News new | ask | show | jobs
by skdotdan 1071 days ago
Not an expert in East Asian languages, but GPT tokenizers are generally byte-based. Meaning that the basic unit to do the merges is a single byte, not a character.
1 comments

GPT uses BPE wit 50k or 100k token vocabulary from what I understand. Given that a lot of space is taken by words and subwords this is not nearly enough for the CJK alphabet.
There are only 256 bytes. CJK characters can be produced by outputting these bytes in a certain order. LLMs are capable of outputting multiple tokens in order because even many words are multiple tokens each.