Hacker News new | ask | show | jobs
by luke-stanley 974 days ago
When I go to this leaderboard: https://huggingface.co/spaces/mteb/leaderboard I click on the "Classification" tab, then I see "jina-embeddings-v2-base-en" at number 12, with an average score of 73.45. But the highest scoring model there is llmrails/ember-v1 with 75.99 average score but it only supports 512 tokens, so if you need 8K tokens to be embedded, I guess they are the best. Do people need 8K of tokens for embedding? Maybe not but they might need more than 512 often enough. It could save a summary extraction step.
1 comments

Small context window means you cannot embed the whole document, you are embedding just a part.

So, if there is some information at the bottom which is dependent on something which is at the top, your embedding could be entirely wrong.