| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jncraton 1007 days ago

This is great to see. It looks like the size of the embedding vector is half the size of text-embedding-ada-002 (768 vs 1536) while providing competitive performance. This will save space in databases and make lookups somewhat faster.

For those unaware, if 512 tokens of context is sufficient for your use case, there are already many options that outperform text-embedding-ada-002 on common benchmarks:

https://huggingface.co/spaces/mteb/leaderboard

1 comments

minimaxir 1007 days ago

The 768D-sized embeddings compared to OpenAI's 1536D embeddings are actually a feature outside of index size.

In my experience, OpenAI's embeddings are overspecified and do very poorly with cosine similarity out of the box as they match syntax more than semantic meaning (which is important as that's the metric for RAG). Ideally you'd want cosine similarity in the range of [-1, 1] on a variety of data but in my experience the results are [0.6, 0.8].

link

TeMPOraL 1007 days ago

Unless I'm missing something, it should be possible to map out in advance which dimensions represent syntactic aspects, and then downweigh or remove them for similarity comparisons. And that map should be a function of the model alone, i.e. fully reusable. Are there any efforts to map out the latent space of ada models like that?

link

karxxm 1007 days ago

You wrote „out of the box“, did you find a way to improve this?

link

teaearlgraycold 1007 days ago

You can do PCA or some other dimensionality reduction technique. That’ll reduce computation and improve signal/noise ratio when comparing vectors.

link

karxxm 1007 days ago

Unfortunately this is not feasible with a large amount of words due to the quadratic scaling. But thanks for the response!

link

minimaxir 1007 days ago

Not sure what you mean by large amount of words. You can fit a PCA on millions of vectors relatively performantly, then inference from it is just a matmul.

link

karxxm 1004 days ago

Not true. You need a distance matrix (for classical PCA it's a covariance matrix), which scales quadratically with the number of points you want to compare. If you have 1 Mio. vectors, each creating a float entry in the matrix, you will end up with approx (10^6)^2 / 2 unique values, which is roughly 2000Gb of memory.

link