|
|
|
|
|
by hansvm
1174 days ago
|
|
And the biggest caveats IMO: - Like any other probabilistic clustering algorithm that gets is speed boost from just ignoring large chunks of the data (like MiniBatch KMeans), this will not take outliers into account very well or very often. - They're advertising this as not just a better KMedoids implementation, but a better drop-in replacement for KMeans. For generic clustering tasks, sure, fine, whatever, but the metric the new paper is trying to optimize is different from the one KMeans uses, so if KMeans has the right metric for a given task then you'll be switching to a (maybe) faster algorithm that just computes the wrong result. The easiest example that comes to mind is a dataset of non-intersecting hollow spheres. KMeans will spit out (for some choice of NClusters) the sphere centers, and KMedoids will spit out sphere boundary points, decreasing performance on the far side of the sphere and potentially allowing classification jumping from one sphere to another. Both of those things are just qualities you may or may not want, so KMedoids may be better purely because it has those biases and KMeans doesn't, but it's not totally uncommon to just want cluster centers minimizing some error and not care how you get there or how explainable they are (the bolt vector quantization algorithm comes to mind), where KMedoids would just be the wrong choice. |
|
Just thought that it might be worth pointing out that there are valid use cases of clustering with real data reference for each class available: consider pictures, to be inspectable it would be helpful to see a real picture instead of some blurry interpolated mess.
Would you mind providing a reference for the bolt vector quantization algorithm? It sounds interesting.