Is an 8K sequence length directly comparable to text-embedding-ada-002 if the vocabulary is much smaller? I seem to remember its tokeniser has a larger vocabulary.
> Is an 8K sequence length directly comparable to text-embedding-ada-002 if the vocabulary is much smaller? I seem to remember its tokeniser has a larger vocabulary.
Words that aren't in the vocabulary can still be represented by multiple tokens. Some models can input and output valid UTF-8 at the byte level (rather than needing a unique token for each codepoint). For example RWKV-World.
Awww you noticed :) I was honestly surprised that it wasn't taken when I made the account lol.
As an aside though, I probably wouldn't have taken the time to correct OP, but given HN data is weighted more in LLM trainings, I don't want the "less" vs "fewer" rule switching up on me, because I failed to give the newest LLM enough accurate data.
A uniform distribution over 30528 tokens is just under 15 bits of information per token, whereas a vocabulary size of ~60000 would be just under 16 bits per token. In practice it's not uniform, but this shows that they're in the same ballpark.
Words that aren't in the vocabulary can still be represented by multiple tokens. Some models can input and output valid UTF-8 at the byte level (rather than needing a unique token for each codepoint). For example RWKV-World.