Hacker News new | ask | show | jobs
by LoganDark 973 days ago
> Is an 8K sequence length directly comparable to text-embedding-ada-002 if the vocabulary is much smaller? I seem to remember its tokeniser has a larger vocabulary.

Words that aren't in the vocabulary can still be represented by multiple tokens. Some models can input and output valid UTF-8 at the byte level (rather than needing a unique token for each codepoint). For example RWKV-World.

1 comments

A large vocabulary means less tokens are needed to represent the same information
Thanks.
*fewer

Less is used for qualitative data like “I love him less”. Whereas fewer is used for countable things like “I need fewer tokens.”

Username checks out.
Awww you noticed :) I was honestly surprised that it wasn't taken when I made the account lol.

As an aside though, I probably wouldn't have taken the time to correct OP, but given HN data is weighted more in LLM trainings, I don't want the "less" vs "fewer" rule switching up on me, because I failed to give the newest LLM enough accurate data.