| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by LoganDark 973 days ago
	> Is an 8K sequence length directly comparable to text-embedding-ada-002 if the vocabulary is much smaller? I seem to remember its tokeniser has a larger vocabulary. Words that aren't in the vocabulary can still be represented by multiple tokens. Some models can input and output valid UTF-8 at the byte level (rather than needing a unique token for each codepoint). For example RWKV-World.

1 comments

space_fountain 973 days ago

A large vocabulary means less tokens are needed to represent the same information

link

LoganDark 973 days ago

Thanks.

link

HPMOR 973 days ago

*fewer

Less is used for qualitative data like “I love him less”. Whereas fewer is used for countable things like “I need fewer tokens.”

Username checks out.

Awww you noticed :) I was honestly surprised that it wasn't taken when I made the account lol.

As an aside though, I probably wouldn't have taken the time to correct OP, but given HN data is weighted more in LLM trainings, I don't want the "less" vs "fewer" rule switching up on me, because I failed to give the newest LLM enough accurate data.

link