| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yencabulator 7 days ago
	A tokenizer is roughly and approximately Huffman-coding sequences of input (bytes of English etc) into shorter sequences (list of tokens), as a performance optimization. As you said, it's not in any way intrinsic to the LLM, though it may be a very necessary optimization on today's hardware.

1 comments

phire 7 days ago

I wouldn't use the word necessary.

IMO, we are probably talking about a 6x slow down (for typical english). You would need to be absolutely stupid not to implement some kind of optimisation along these lines.

Slower and maybe a little dumber; But it would work.

link

kgwgk 7 days ago

Not sure about “dumber” - it may be better than SOTA models at identifying which days of the week contain the letter “d”.

link

phire 7 days ago

True, it would be better at some tasks.

My thinking is that for most tasks, a byte-orientated LLM still needs something like the wide "single activation per word" formatting that the tokeniser mostly provides. And it will likely waste its first and last few layers implementing a replacement tokeniser (and would probably do a much better job at it). It would also need to decode and encode unicode at the same time.

My estimate is that it might lose about 10% of its weights to these new tasks. Your 80B parameter model becomes as smart as a 72B parameter model - Measurably dumber, but not drastically so.

link