Hacker News new | ask | show | jobs
by RossBencina 974 days ago
Some relevant stats from the link:

8192 token input sequence length

768 embedding dimensions

0.27GB model (with 0.07GB model also available)

Tokeniser: BertTokenizer [1], 30528 token vocab [2]

Is an 8K sequence length directly comparable to text-embedding-ada-002 if the vocabulary is much smaller? I seem to remember its tokeniser has a larger vocabulary.

[1] https://huggingface.co/jinaai/jina-embeddings-v2-base-en/blo...

[2] https://huggingface.co/jinaai/jina-embeddings-v2-base-en/blo...

3 comments

> Is an 8K sequence length directly comparable to text-embedding-ada-002 if the vocabulary is much smaller? I seem to remember its tokeniser has a larger vocabulary.

Words that aren't in the vocabulary can still be represented by multiple tokens. Some models can input and output valid UTF-8 at the byte level (rather than needing a unique token for each codepoint). For example RWKV-World.

A large vocabulary means less tokens are needed to represent the same information
Thanks.
*fewer

Less is used for qualitative data like “I love him less”. Whereas fewer is used for countable things like “I need fewer tokens.”

Username checks out.
Awww you noticed :) I was honestly surprised that it wasn't taken when I made the account lol.

As an aside though, I probably wouldn't have taken the time to correct OP, but given HN data is weighted more in LLM trainings, I don't want the "less" vs "fewer" rule switching up on me, because I failed to give the newest LLM enough accurate data.

A uniform distribution over 30528 tokens is just under 15 bits of information per token, whereas a vocabulary size of ~60000 would be just under 16 bits per token. In practice it's not uniform, but this shows that they're in the same ballpark.
Thanks what size gpu would you need to fine tune or do an inference?