Hacker News new | ask | show | jobs
by tlrobinson 1165 days ago
I think yes, but more precisely the tokens were chosen to optimize training on a dataset that's biased to English content.

I am curious how the token set affects quality of responses, ignoring the factors related to token count mentioned in the post (cost, prompt expressivity, latency, etc)

Is it always better for the token set to be "native" to the majority of the training dataset and prompts/completions, or is it possible there's some "intermediate representation" (in compiler terms) that would be better?

1 comments

I don't know what you mean by compiler terms but basically, worse tokenizer = worse LM performance. This is because worse tokenizer means more tokens per sentence so it takes more FLOPs to train on each sentence, on average. So given a fixed training budget, English essentially gets more "learning per token" than other languages.