|
|
|
|
|
by nl
55 days ago
|
|
This is almost certainly wrong. Case sensitive language models have been a thing since way before neural language models. I was using them with boosted tree models at least ten years ago, and even my Java NLP tool did this twenty years ago (damn!). There is no novelty there of course - I based that on PG's "A Plan for Spam". See for example CountVectorizer: https://scikit-learn.org/stable/modules/generated/sklearn.fe... The bitter lesson says that you are much better off just adding more data and learning the tokenizer and it will be better. It's not impossible that the new Opus tokenizer is based on something learnt during Mythos pre-training (maybe it is *the learned Mythos tokenizer?%), and it seems likely that the Mythos pre-training run is the most data ever trained on. Putting an inductive bias in your tokenizer seems just a terrible idea. |
|
This is similar to what the TokenMonster tokenizer does: https://github.com/alasdairforsythe/tokenmonster