|
|
|
|
|
by yorwba
522 days ago
|
|
I don't mean switching to one byte per token, but switching to training on the token distribution that results from cutting off the input at arbitrary bytes. The bytes per token should be basically unchanged, as only the end gets a bit shorter. |
|
It ends up being O(max token length) more complex and so you end up needing a proportionally larger model to accommodate it.