| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by weinzierl 1119 days ago
	GPT uses BPE wit 50k or 100k token vocabulary from what I understand. Given that a lot of space is taken by words and subwords this is not nearly enough for the CJK alphabet.

1 comments

LoganDark 1119 days ago

There are only 256 bytes. CJK characters can be produced by outputting these bytes in a certain order. LLMs are capable of outputting multiple tokens in order because even many words are multiple tokens each.

link