|
|
|
|
|
by PaulHoule
1148 days ago
|
|
If a transformer has a good "place" to assign meanings to I think it does a pretty good job of (1) discovering similar meanings in synonyms, (2) representing words differently based on context. That later one is a huge advance over word embeddings which I thought were holding progress back instead of advancing it. You're right that what they are doing is morphological, not semantic, but it helps a lot. I would say that 日本語
"Japanese Language" is a good token to apply embedding, attention, etc. to because it has a definite meaning to which the transformer can attach whatever syntax and semantics it learns in terms of activations. If BPE gives up and processes it as UTF-8 bytes e6 97 a5 e6 9c ac e8 aa 9e
there is no clear meaning for any one of those tokens, and the model is going to have to work a lot harder. |
|
And yes, what they do helps on their two test tasks. I'm not disputing that. It's the fact that there's no scholarship here.
There are so many thousands of knobs to twiddle with in a model these days, and they went after one that's commonly regarded in the NLP community as the 'defect'—the only part of the model that's not end-to-end trained along with the rest. Which would be great, if they acknowledged it! But there's no citation to any tokenization literature beyond BPE or SentencePiece. The literature review is as superficial as what you could find in a blog.
There are certainly byte-level or character-level tokenizers (think about CANINE or ByT5), and we can argue back and forth about their data-hungriness or slow inference. It would be nice to give more helpful units to a Transformer, so it doesn't have to learn syllables (or even characters) all on its own. Rebracketing/incorrect segmentation is a problem! And these authors have clued into that, but so have several hundred (or thousand?) researchers they don't cite.
What I'm having trouble with is the notion that this paper uncovered some exciting, revelatory fact about tokenization. Yes, "Japanese Language" would be a reasonable semantic unit! But these authors didn't discover that fact. Nobody's questioning whether 'good tokenization is better than bad tokenization'. Tokenization has seen ongoing attention in NLP forever.
These authors tried one variant, compared it against a library default option (and nothing else), evaluated on one task, put a bit of marketing around it, and called it a day. In the NLP course I used to TA, this wouldn't even qualify as a complete final project for the course.