Hacker News new | ask | show | jobs
by naillo 970 days ago
This is so overcomplicated. You don't need an actual tree that's just a mental model. All you need to do is generate N random hyperplanes (defined via it's normal i.e. generate N normalized vectors in the dimension of your choosing). Then you generate a hash via these by going normal by normal and checking if your query vector `q` is on the left or right side of the normal by checking the sign of the dot product. I.e. `hash = [sign(dot(q, n1)), sign(dot(q, n2)), ...]`.

So given a query you find that hash and you get all the vectors with the same hash (using your favorite hash to array data structure, i.e. a dict) and you can further refine the search within that bucket. It's like 10 rows of code.

6 comments

Annoy author here. What you're describing is Locality Sensitive Hashing (LSH) which is something I spent a lot of time trying back in 2009-2012 but never got it working. It has some elegant theoretical properties, but empirically it always has terrible performance. The reason I don't think it works well is that data often lies near a lower-dimensionality manifold that may have some sort of shape. LSH would "waste" most splits (because it doesn't understand the data distribution) but using trees (and finding splits such that the point set is partitioned well) ends up "discovering" the data distribution better.

(but HNSW is generally much better than this tree paritioning scheme)

The solution to that is balanced LSH, which ensures each hyperplane divides the data evenly by moving its offset, if I recall.
Faiss GPU author here. Vector binarization via LSH for dense vector, high-dimensional (say 30 - 2000ish dimensions) tends to have poor recall and quality of results. As the Annoy author implied here as well, there's a lot of literature on LSH but that's because it is easier to analyze mathematically and prove theorems regarding it, not that it produces good quality results compared to graph-based (HNSW etc) or different geometric cell probe-based (IVF etc) methods. LSH is not typically used in industrial scale settings for these reasons.

However, for extremely high dimensional sparse vectors (think tens of thousands to millions of dimensions), LSH does remain the only practical approximate technique to my knowledge. Online settings (where you need to insert vectors quickly into the database) may prefer LSH as well as the overhead is smaller than updating a graph or finding the IVF cell in which to place the vector.

Below ~30 dimensions, exact non-brute force methods (like k-D trees or binary space partitioning methods) work ok, but they have high overheads above that because your true nearest neighbor is still highly likely to lie on either side of any dividing hyperplane (it doesn't partition the dataset well; everything is usually "near" everything else in practice in high dimensions).

> All you need to do

It seems to me that your approach is making some assumptions about how the data is distributed that aren't necessarily guaranteed to hold in practice, and a tree has a chance of being robust to at least some of those. For example, whatever random hyperplane you choose has to be informed by your data. Suppose you make a particularly bad choice for one of those hyperplanes, with your approach you'll be stuck with that for all time and have a bit in your hash that's biased to heck, but in the tree approach you only suffer when you hash your way down to the lousy node, and all other paths will be unaffected.

Sounds like you're describing LSH (locality sensitive hashing).
I implemented this recently in C as a numpy extension[1], for. fun. Even had a vectorized solution going.

You'll get diminishing returns on recall pretty fast. There's actually a theorem that tracks this - Jordan-Lindenstrauss lemma[2] if you're interested.

As I mention in a talk I gave[3], it can work if you're going to rerank anyway. And whatever vector search thing isn't the main ranking signal. And you don't have that much data. It's also easy to update, as the hashes are non-parametric (they don't depend on the data). Constantly updating vector search systems is often neglected in benchmarks.

The lack of data-dependency, however is the main problem. Vector spaces are lumpy. You can see this in the distribution of human beings on the surface of the earth - postal codes and area codes vary from small to huge - random hashes, like a grid, wouldn't let you accurately map out the distribution of all the people or clump them close to their actual nearest neighbors. Manhattan is not rural Alaska... and more dimensions, more data points, the more ways it has to be lumpy.

Annoy, actually, builds on these hashes, by creating many trees of such hashes, and then finds a split in the left and right. Then in creates a forest of such trees. So its essentially a forest of random hash trees with data dependency.

Hope that helps.

1 - https://github.com/softwaredoug/np-sims

2 - https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_...

3 - https://softwaredoug.com/blog/2023/09/05/vector-search-the-h...

I've implemented the same approach (non-parametric), and it worked well to speed up the search on my experiment with 384-dimensional embeddings of 30 million Hacker News comments.

I presented it in my talk, where you can see the speed comparison: https://www.youtube.com/watch?v=hGRNcftpqAk

It really isn't. That's like saying bisection is overly complicated.