Hacker News new | ask | show | jobs
by sailingparrot 702 days ago
> Can you change the tokenizer?

Yes.

You can change it however you like, then look at the paper [1] under section 3.2. to know which hyperparameters were used during training and finetune the model to work with your new tokenizer using e.g. FineWeb [2] dataset.

You'll need to do only a fraction of the training you would have needed to do if you were to start a training from scratch for your tokenizer of choice. The weights released by Meta give you a massive head start and cost saving.

The fact that it's not trivial to do and out of reach of most consumer is not a matter of openness. That's just how ML is today.

[1]: https://scontent-sjc3-1.xx.fbcdn.net/v/t39.2365-6/452387774_...

[2]: https://huggingface.co/datasets/HuggingFaceFW/fineweb

1 comments

You can change the tokenizer and build another model, if you can come up with your own version of the rest of the source (e.g., the training set, RLHF, etc.). You can’t change the tokenizer for this model, because you don’t have all of its source.
There is nothing that requires you to train with the same training set, or to re-do RLHF. You can train on fineweb, and llama 3.1 will learn to use your new tokenizer just fine.

There is 0 doubt that you are better of finetuning that model to use your tokenizer than training from scratch. So what Meta gives you for free massively helps you building your model, that's OSS to me.