| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ubutler 775 days ago

There are two reasons I can think of why someone might reuse a tokeniser:

1. They want to continue pretraining a model instead of starting from scratch. But actually people might not know that you can pretty easily reuse model weights even when training with a new tokeniser (I’ve got a blog post on how to do that: https://umarbutler.com/how-to-reuse-model-weights-when-train... ).

2. Because it’s convenient for end users. Tokenising and chunking really large corpora can take a long time and it’s nice that I can use the GPT2 tokeniser and then train a bunch of different models on that data without having to retokenise everything.