|
|
|
|
|
by famouswaffles
514 days ago
|
|
Have you seen the Byte-latent Transformer paper? It does away with sub-word tokenization but is still more or less a transformer (no working memory or internal iteration). Mostly, the (performance) gains seem modest (not unanimous, some benchmarks it's a bit worse) ....until you hit anything to do with character level manipulation and it just stomps. 1.1% to 99% on CUTE - Spelling as a particularly egregious example. I'm not sure what the problem is exactly but clearly something about sub-word tokenization is giving these models a particularly hard time on these sort of tasks. https://arxiv.org/abs/2412.09871 |
|
https://arxiv.org/pdf/2409.15452