Hacker News new | ask | show | jobs
by tayo42 924 days ago
I wonder if you could use a smaller model or get better results if you treated each card as a token, gave the state of the draft as an input and the predicted token would be the card to pick. You woukd have to train from scratch with a custom tokenizer.
2 comments

I tried adding special tokens for a reddit-style dataset once. The format was: `<|post_author|>username<|post_title|>title here...`

The resulting model was so much worse than just formatting everything plaintext. This was with MPT-30B, 15 special tokens, 300M training tokens, and a full finetune.

I may have made a mistake, but I haven't seen any open source finetunes successfully add a large number of tokens yet either.

Try doing the same thing in your dataset, but don't actually add them as "special tokens", and just let them just be multiple tokens.

Adding new tokens needs a ton of data to train what the token means. Reusing existing tokens, will allow you to easily teach that a sequence of tokens now has a new meaning after fine tuning.

That's what I ended up doing (`[Author] username [Title] post title...`)

> Adding new tokens needs a ton of data to train what the token means.

But how much? 300M tokens is fine for a simple version of ChatML with ~4 tokens. Not for 15, at least in my case. How's this relationship scale?

Just trying to offer one datapoint for what doesn't work, with the hedge that I might have just had a bug

I don't know how many tokens are required to get good results, because I simply didn't mark mine as "special_tokens" due to the issues that I had read about. I got great results, whereas others who have tried special tokens got pretty poor results. I'm sure there is a magic number, but it's just not been worth it for me to explore that area yet.
I don't mean add special tokens, but make the vocab only the set of possible cards. each card is a token.

a simple input might be <cards you hold> 1 14 56</end><cards to pick> 5 64 2</end> -> predicted token is the draft pick.

Then train a transformer based network from scratch.

I was thinking something fairly similar. You could probably do pretty well with a basic NN setup this way, no need for an LLM. It wouldn't work on "never seen before cards" and would probably make some absurd picks when it's wrong, but I'd bet you could get to 90% accuracy.