Hacker News new | ask | show | jobs
by kujaomega 2904 days ago
I see a good explanation of the problem and a good evolution of the done steps. But I see a problem in the approach. When you are getting the most similar result, you are supposed to compute high cosine similarity between all the embeddings. If you have more than a billion of embeddings and the embeddings have 1k dimensions, it will take a lot of time. How would you solve this problem? Clustering the embeddings?
2 comments

There are off-the-shelf libraries like ANNOY and nmslib that index the vectors in a way that allows for fast (possibly approximate) nearest neighbors searches.
This is generally called the k-Nearest Neighbors problem. You should check out the various data structures for doing this, like the ball tree and kd tree.