|
|
|
|
|
by HarHarVeryFunny
757 days ago
|
|
It seems sub-word tokenization vs using character inputs is just a trade off to gain computational efficiency, and obviously isn't how our brain works. We're not born with a fixed visual tokenization scheme - we learn to create our own groupings and object representations. However, transformers seem to struggle a bit with accurately manipulating sequences, so going to character inputs and hoping for those to be aggregated into words/numbers/etc might cause more problems than it solves? I have to wonder if these models would not be better off learning whole-word embeddings rather than tokens. You'd have thought they would learn embeddings that encode any useful relatedness (e.g. corresponding to common prefixes) between words. Perhaps numbers would be better off input as a sequence of individual digit embeddings. |
|