| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kujaomega 2904 days ago
	I see a good explanation of the problem and a good evolution of the done steps. But I see a problem in the approach. When you are getting the most similar result, you are supposed to compute high cosine similarity between all the embeddings. If you have more than a billion of embeddings and the embeddings have 1k dimensions, it will take a lot of time. How would you solve this problem? Clustering the embeddings?

2 comments

bunderbunder 2904 days ago

There are off-the-shelf libraries like ANNOY and nmslib that index the vectors in a way that allows for fast (possibly approximate) nearest neighbors searches.

link

bglazer 2904 days ago

This is generally called the k-Nearest Neighbors problem. You should check out the various data structures for doing this, like the ball tree and kd tree.

link