Hacker News new | ask | show | jobs
by 0800 3053 days ago
> As said earlier, it is lazy learning algorithm and therefore requires no training prior to making real time predictions. This makes the KNN algorithm much faster than other algorithms that require training e.g SVM, linear regression, etc.

Linear regression is way faster than KNN as the dataset grows beyond toy data. Both training and especially testing. In practical applications, test speed often trumps train speed.

> The KNN algorithm doesn't work well with high dimensional data because with large number of dimensions, it becomes difficult for the algorithm to calculate distance in each dimension.

KNN works fine on high-dimensional text. From something simple as Hamming distance on binary tokens, to euclidean distance on TFIDF, to cosine distance on 900-dimensional word vector aggregates.

> There are only two parameters required to implement KNN i.e. the value of K and the distance function (e.g. Euclidean or Manhattan etc.)

Also implement distance weighing (you probably want to weigh the 1-th nearest neighbor label higher than the 5-th nearest neighbor label).

> The KNN algorithm has a high prediction cost for large datasets. This is because in large datasets the cost of calculating distance between new point and each existing point becomes higher.

This is why you "fit" something like a K-D tree during training.

> Finally, the KNN algorithm doesn't work well with categorical features since it is difficult to find the distance between dimensions with categorical features.

Hamming distance works fine on one-hot encoded categorical features. If not, embed/reduce the dimensionality. If not, use feature selection first and do KNN on top 10%-20% features. Remember that you don't have to use the same distance measure for every feature column.

> If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.

You can skip this step (and feature selection) by learning an additional weight for each feature to multiply with before distance measure. It is rare for each feature to contribute proportionately to the target.

2 comments

> KNN works fine on high-dimensional text. From something simple as Hamming distance on binary tokens, to euclidean distance on TFIDF, to cosine distance on 900-dimensional word vector aggregates.

> This is why you "fit" something like a K-D tree during training.

I would not choose a K-D tree for that. The curse of dimensionality makes K-D trees prohibitively useless as dimensions go up. The number of partitions you have to inspect explodes.

Locality Sensitive Hashing tackles this explosion, but with a tradeoff on recall power. 80% recall is quite easy to reach, though. Being near 100% will be prohibitively expensive. This could be good enough for an approximated KNN.

All of your claims call for citations. Its not that they are untrue. But they are not as sweepingly true as they might seem to an uninitiated reader. For example, you can try KD tree on a 50K dimension data set and judge for yourself.
I just tried this with 50k dimensions and 2 rows and it worked fine.