|
|
|
|
|
by float-trip
927 days ago
|
|
I tried adding special tokens for a reddit-style dataset once. The format was: `<|post_author|>username<|post_title|>title here...` The resulting model was so much worse than just formatting everything plaintext. This was with MPT-30B, 15 special tokens, 300M training tokens, and a full finetune. I may have made a mistake, but I haven't seen any open source finetunes successfully add a large number of tokens yet either. |
|
Adding new tokens needs a ton of data to train what the token means. Reusing existing tokens, will allow you to easily teach that a sequence of tokens now has a new meaning after fine tuning.