| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by minimaxir 974 days ago
	The 768D-sized embeddings compared to OpenAI's 1536D embeddings are actually a feature outside of index size. In my experience, OpenAI's embeddings are overspecified and do very poorly with cosine similarity out of the box as they match syntax more than semantic meaning (which is important as that's the metric for RAG). Ideally you'd want cosine similarity in the range of [-1, 1] on a variety of data but in my experience the results are [0.6, 0.8].

2 comments

TeMPOraL 974 days ago

Unless I'm missing something, it should be possible to map out in advance which dimensions represent syntactic aspects, and then downweigh or remove them for similarity comparisons. And that map should be a function of the model alone, i.e. fully reusable. Are there any efforts to map out the latent space of ada models like that?

link

karxxm 974 days ago

You wrote „out of the box“, did you find a way to improve this?

link

teaearlgraycold 974 days ago

You can do PCA or some other dimensionality reduction technique. That’ll reduce computation and improve signal/noise ratio when comparing vectors.

link

karxxm 974 days ago

Unfortunately this is not feasible with a large amount of words due to the quadratic scaling. But thanks for the response!

link

minimaxir 973 days ago

Not sure what you mean by large amount of words. You can fit a PCA on millions of vectors relatively performantly, then inference from it is just a matmul.

link

karxxm 971 days ago

Not true. You need a distance matrix (for classical PCA it's a covariance matrix), which scales quadratically with the number of points you want to compare. If you have 1 Mio. vectors, each creating a float entry in the matrix, you will end up with approx (10^6)^2 / 2 unique values, which is roughly 2000Gb of memory.

link