|
|
|
|
|
by Tostino
924 days ago
|
|
Try doing the same thing in your dataset, but don't actually add them as "special tokens", and just let them just be multiple tokens. Adding new tokens needs a ton of data to train what the token means. Reusing existing tokens, will allow you to easily teach that a sequence of tokens now has a new meaning after fine tuning. |
|
> Adding new tokens needs a ton of data to train what the token means.
But how much? 300M tokens is fine for a simple version of ChatML with ~4 tokens. Not for 15, at least in my case. How's this relationship scale?
Just trying to offer one datapoint for what doesn't work, with the hedge that I might have just had a bug