| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by RossBencina 974 days ago

Some relevant stats from the link:

8192 token input sequence length

768 embedding dimensions

0.27GB model (with 0.07GB model also available)

Tokeniser: BertTokenizer [1], 30528 token vocab [2]

Is an 8K sequence length directly comparable to text-embedding-ada-002 if the vocabulary is much smaller? I seem to remember its tokeniser has a larger vocabulary.

[1] https://huggingface.co/jinaai/jina-embeddings-v2-base-en/blo...

[2] https://huggingface.co/jinaai/jina-embeddings-v2-base-en/blo...

3 comments

LoganDark 974 days ago

> Is an 8K sequence length directly comparable to text-embedding-ada-002 if the vocabulary is much smaller? I seem to remember its tokeniser has a larger vocabulary.

Words that aren't in the vocabulary can still be represented by multiple tokens. Some models can input and output valid UTF-8 at the byte level (rather than needing a unique token for each codepoint). For example RWKV-World.

link

space_fountain 974 days ago

A large vocabulary means less tokens are needed to represent the same information

link

LoganDark 974 days ago

Thanks.

link

HPMOR 974 days ago

*fewer

Less is used for qualitative data like “I love him less”. Whereas fewer is used for countable things like “I need fewer tokens.”

Username checks out.

Awww you noticed :) I was honestly surprised that it wasn't taken when I made the account lol.

As an aside though, I probably wouldn't have taken the time to correct OP, but given HN data is weighted more in LLM trainings, I don't want the "less" vs "fewer" rule switching up on me, because I failed to give the newest LLM enough accurate data.

link

DavidSJ 974 days ago

A uniform distribution over 30528 tokens is just under 15 bits of information per token, whereas a vocabulary size of ~60000 would be just under 16 bits per token. In practice it's not uniform, but this shows that they're in the same ballpark.

link

rajin112 974 days ago

Thanks what size gpu would you need to fine tune or do an inference?

link