| HN Mirror

By your first paragraph's argument, the semantics are in the Transformer, not the tokenizer.

And yes, what they do helps on their two test tasks. I'm not disputing that. It's the fact that there's no scholarship here.

There are so many thousands of knobs to twiddle with in a model these days, and they went after one that's commonly regarded in the NLP community as the 'defect'—the only part of the model that's not end-to-end trained along with the rest. Which would be great, if they acknowledged it! But there's no citation to any tokenization literature beyond BPE or SentencePiece. The literature review is as superficial as what you could find in a blog.

There are certainly byte-level or character-level tokenizers (think about CANINE or ByT5), and we can argue back and forth about their data-hungriness or slow inference. It would be nice to give more helpful units to a Transformer, so it doesn't have to learn syllables (or even characters) all on its own. Rebracketing/incorrect segmentation is a problem! And these authors have clued into that, but so have several hundred (or thousand?) researchers they don't cite.

What I'm having trouble with is the notion that this paper uncovered some exciting, revelatory fact about tokenization. Yes, "Japanese Language" would be a reasonable semantic unit! But these authors didn't discover that fact. Nobody's questioning whether 'good tokenization is better than bad tokenization'. Tokenization has seen ongoing attention in NLP forever.

These authors tried one variant, compared it against a library default option (and nothing else), evaluated on one task, put a bit of marketing around it, and called it a day. In the NLP course I used to TA, this wouldn't even qualify as a complete final project for the course.