Hacker News new | ask | show | jobs
by arnoldoMuller 5318 days ago
VP trees are well known but not necessarily the best approach for large datasets or high dimensional objects. I am creating a startup that offers the fastest similarity search engine built so far (10X faster than LSH from MIT). See more here: http://simmachines.com
2 comments

You probably can't go into any details, but can you describe the rough approach R-01 is taking? Is it based on some existing paper or something completely new?
He has a white paper (understandably vague) on the site: http://www.simmachines.com/resources/whitePaper.pdf

It does sound interesting. These metric-space based methods usually make poor use of caches at all levels (random access). So while they may indeed access only <10% of objects in the index, they can nevertheless be slower than a simple, cache-happy linear scan (given a simple enough distance function).

I wonder how that last example -- 120M strings in RAM, 5-NN search -- compares against an optimized linear scan?

EDIT: at the end of the white paper there are also several references to the author's academic articles. Apparently the method is based on speeding up sequential scans by compression ("sketches"), so there :)

I don't understand how this is related to the article. The article has to do with searching spatial data structures, such as you would make for partitioning the space of a large scene in a graphics or mapping application (or game, etc.). Maybe I'm wrong and someone could enlighten me....
It doesn't have to be for spatial data, that's just the easiest way to get N-dimensional data.

Basically, every variable in your problem becomes a dimension. You could think of grocery shopping as a 10-dimensional problem involving cost, dietary restrictions, nutritional value (we'll say there are 7 for this example) and brand preference. You could potentially use a VP tree to search for all products that fall within a 10-D sphere that encodes your product tolerance.

3D spatial problems are just three dimensions. True you can do that with VP but you can do that with octtree too.

Imagine beyond 3D like clustering of friends or likes or recommendations or all those other N-dimensional data-points and you'll see the overlap.

Right. There are plenty of cases where you ONLY have the distance (or similarity) between two points and don't actually have an n-dimensional point in space.