|
|
|
|
|
by _t89y
848 days ago
|
|
Thanks for your reply. That's my first point. In 10 years we have word2vec, GloVe, GPT-2 and... tiktoken. lol. It's as if directional, numeric magnitudes in an embedding space of arbitrary dimensionality have magically captured or will magically capture the nuances and expressivity of language. Optimization techniques and new strategies for domain adaption are what matters, particularly for mobile devices, on-device ASR and short-form videos. I don't think robust is a good characterization of clusters of semantic attributes in space or a distributional semantics of language. I'd say crude and without understanding are more accurate descriptions. Capturing semantic properties sometimes is not the same thing as having a semantics. By targeted improvements you must be referring to domain adaptation and by the default option you must be referring to attention over BPE tokens? You can move directional quantities around in directional quantity space all day. If it results in expected behavior for your application that you weren't getting before that's great. If that's all you want to get out of these models then indeed there's nothing to do here. I'm not after improvements so much as I'm after something that works. |
|
If you don't care about tokenization and use any of the reasonable default options without caring about them, and if you're doing a proper pre-training on non-tiny quantities of data, then the next few layers of whatever neural architecture you have on top of these tokens will generally be able to learn to compensate for any drawbacks in your tokenization, perhaps at some computation overhead - e.g. perhaps you could have had one less layer or smaller layers if you had the best tokenization possible, and edging out that computation cost improvement is pretty much the only thing you can hope to get out of having a better tokenizer.