| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by skdotdan 1119 days ago
	Not an expert in East Asian languages, but GPT tokenizers are generally byte-based. Meaning that the basic unit to do the merges is a single byte, not a character.

1 comments

weinzierl 1119 days ago

GPT uses BPE wit 50k or 100k token vocabulary from what I understand. Given that a lot of space is taken by words and subwords this is not nearly enough for the CJK alphabet.

link

LoganDark 1119 days ago

There are only 256 bytes. CJK characters can be produced by outputting these bytes in a certain order. LLMs are capable of outputting multiple tokens in order because even many words are multiple tokens each.

link