Hacker News new | ask | show | jobs
by weinzierl 1072 days ago
GPT uses BPE wit 50k or 100k token vocabulary from what I understand. Given that a lot of space is taken by words and subwords this is not nearly enough for the CJK alphabet.
1 comments

There are only 256 bytes. CJK characters can be produced by outputting these bytes in a certain order. LLMs are capable of outputting multiple tokens in order because even many words are multiple tokens each.