| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yorwba 528 days ago
	Ideally you'd have a language model that can predict a good continuation after any byte. If an existing model can't do that because it's too reliant on a specific tokenization, you might nonetheless be able to fine-tune it until it can gracefully handle the unexpected tokenizations that result from splitting at a random byte.

1 comments

kevmo314 528 days ago

Such a model will always be less performant than one on tokens, as you're effectively switching to one byte per token. Solving this problem in code is much cheaper.

link

yorwba 528 days ago

I don't mean switching to one byte per token, but switching to training on the token distribution that results from cutting off the input at arbitrary bytes. The bytes per token should be basically unchanged, as only the end gets a bit shorter.

link

kevmo314 528 days ago

Yeah I've tried that approach. The model ends up needing to learn every combination of tokens. For example, the word "apple" now has six bytes positions it can be split on and the model suddenly needs to learn that all six will yield the same output attention state.

It ends up being O(max token length) more complex and so you end up needing a proportionally larger model to accommodate it.

link

pizza 528 days ago

Seems like we should just use gradual annealing of tokens to more fine grained single character tokens over the course of training then

link

kevmo314 528 days ago

I believe that's similar to the idea behind https://github.com/facebookresearch/blt

link