Hacker News new | ask | show | jobs
by kevmo314 519 days ago
Yeah I've tried that approach. The model ends up needing to learn every combination of tokens. For example, the word "apple" now has six bytes positions it can be split on and the model suddenly needs to learn that all six will yield the same output attention state.

It ends up being O(max token length) more complex and so you end up needing a proportionally larger model to accommodate it.

1 comments

Seems like we should just use gradual annealing of tokens to more fine grained single character tokens over the course of training then
I believe that's similar to the idea behind https://github.com/facebookresearch/blt