| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by softwaredoug 970 days ago

I implemented this recently in C as a numpy extension[1], for. fun. Even had a vectorized solution going.

You'll get diminishing returns on recall pretty fast. There's actually a theorem that tracks this - Jordan-Lindenstrauss lemma[2] if you're interested.

As I mention in a talk I gave[3], it can work if you're going to rerank anyway. And whatever vector search thing isn't the main ranking signal. And you don't have that much data. It's also easy to update, as the hashes are non-parametric (they don't depend on the data). Constantly updating vector search systems is often neglected in benchmarks.

The lack of data-dependency, however is the main problem. Vector spaces are lumpy. You can see this in the distribution of human beings on the surface of the earth - postal codes and area codes vary from small to huge - random hashes, like a grid, wouldn't let you accurately map out the distribution of all the people or clump them close to their actual nearest neighbors. Manhattan is not rural Alaska... and more dimensions, more data points, the more ways it has to be lumpy.

Annoy, actually, builds on these hashes, by creating many trees of such hashes, and then finds a split in the left and right. Then in creates a forest of such trees. So its essentially a forest of random hash trees with data dependency.

Hope that helps.

1 - https://github.com/softwaredoug/np-sims

2 - https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_...

3 - https://softwaredoug.com/blog/2023/09/05/vector-search-the-h...

1 comments

zX41ZdbW 970 days ago

I've implemented the same approach (non-parametric), and it worked well to speed up the search on my experiment with 384-dimensional embeddings of 30 million Hacker News comments.

I presented it in my talk, where you can see the speed comparison: https://www.youtube.com/watch?v=hGRNcftpqAk

link