| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Radim 5318 days ago

He has a white paper (understandably vague) on the site: http://www.simmachines.com/resources/whitePaper.pdf

It does sound interesting. These metric-space based methods usually make poor use of caches at all levels (random access). So while they may indeed access only <10% of objects in the index, they can nevertheless be slower than a simple, cache-happy linear scan (given a simple enough distance function).

I wonder how that last example -- 120M strings in RAM, 5-NN search -- compares against an optimized linear scan?

EDIT: at the end of the white paper there are also several references to the author's academic articles. Apparently the method is based on speeding up sequential scans by compression ("sketches"), so there :)