| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by belladoreai 1111 days ago
	You are almost correct, though it doesn't happen at character level, it happens at byte level. Most characters are in LLaMA tokenizer's vocabulary, but all characters aren't. So if you use a character that was uncommon in the training material, it will fall back to byte-level tokens. In most cases 1 character can be represented as 1 byte (and thus 1 byte-level token). However, some characters require more than 1 byte in UTF-8; those characters might end up with as much as 4 tokens.

1 comments

Majromax 1110 days ago

> However, some characters require more than 1 byte in UTF-8; those characters might end up with as much as 4 tokens.

This would seem to raise an interesting "prompt golf" challenge: find a reasonable-sounding prompt that causes the language model to generate invalid UTF-8 in its output.

link