|
|
|
|
|
by liliumregale
1145 days ago
|
|
I'm going to add a contrarian take here: this preprint is not a research paper. While it's nice to see that there is an improvement here on their one task, this is not "semantically" driven tokenization. It's morphologically driven. To be semantically driven, it would be reasonable to expect that synonyms would have similar representations. I got really excited from the title, and the content is a let-down. The line of research here has been going on for 30+ years, from Michael Brent's work, to Linguistica, to Morfessor, and now several approaches to incorporate morphology into tokenizers. The stand-out example is [0]. This paper doesn't seem to acknowledge any of that intellectual legacy. It's not a _research_ paper. I'm getting a bit tired of people putting their class projects or quick engineering projects on arXiv. I don't know why they're surfacing so high on HN either. [0]: https://aclanthology.org/2021.acl-long.279/ |
|
You're right that what they are doing is morphological, not semantic, but it helps a lot. I would say that
"Japanese Language" is a good token to apply embedding, attention, etc. to because it has a definite meaning to which the transformer can attach whatever syntax and semantics it learns in terms of activations. If BPE gives up and processes it as UTF-8 bytes there is no clear meaning for any one of those tokens, and the model is going to have to work a lot harder.