|
|
|
|
|
by ubutler
775 days ago
|
|
There are two reasons I can think of why someone might reuse a tokeniser: 1. They want to continue pretraining a model instead of starting from scratch. But actually people might not know that you can pretty easily reuse model weights even when training with a new tokeniser (I’ve got a blog post on how to do that: https://umarbutler.com/how-to-reuse-model-weights-when-train... ). 2. Because it’s convenient for end users. Tokenising and chunking really large corpora can take a long time and it’s nice that I can use the GPT2 tokeniser and then train a bunch of different models on that data without having to retokenise everything. |
|