| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by BenoitP 3052 days ago

> KNN works fine on high-dimensional text. From something simple as Hamming distance on binary tokens, to euclidean distance on TFIDF, to cosine distance on 900-dimensional word vector aggregates.

> This is why you "fit" something like a K-D tree during training.

I would not choose a K-D tree for that. The curse of dimensionality makes K-D trees prohibitively useless as dimensions go up. The number of partitions you have to inspect explodes.

Locality Sensitive Hashing tackles this explosion, but with a tradeoff on recall power. 80% recall is quite easy to reach, though. Being near 100% will be prohibitively expensive. This could be good enough for an approximated KNN.