Hacker News new | ask | show | jobs
by float-trip 927 days ago
I tried adding special tokens for a reddit-style dataset once. The format was: `<|post_author|>username<|post_title|>title here...`

The resulting model was so much worse than just formatting everything plaintext. This was with MPT-30B, 15 special tokens, 300M training tokens, and a full finetune.

I may have made a mistake, but I haven't seen any open source finetunes successfully add a large number of tokens yet either.

2 comments

Try doing the same thing in your dataset, but don't actually add them as "special tokens", and just let them just be multiple tokens.

Adding new tokens needs a ton of data to train what the token means. Reusing existing tokens, will allow you to easily teach that a sequence of tokens now has a new meaning after fine tuning.

That's what I ended up doing (`[Author] username [Title] post title...`)

> Adding new tokens needs a ton of data to train what the token means.

But how much? 300M tokens is fine for a simple version of ChatML with ~4 tokens. Not for 15, at least in my case. How's this relationship scale?

Just trying to offer one datapoint for what doesn't work, with the hedge that I might have just had a bug

I don't know how many tokens are required to get good results, because I simply didn't mark mine as "special_tokens" due to the issues that I had read about. I got great results, whereas others who have tried special tokens got pretty poor results. I'm sure there is a magic number, but it's just not been worth it for me to explore that area yet.
I don't mean add special tokens, but make the vocab only the set of possible cards. each card is a token.

a simple input might be <cards you hold> 1 14 56</end><cards to pick> 5 64 2</end> -> predicted token is the draft pick.

Then train a transformer based network from scratch.