Hacker News new | ask | show | jobs
by Scene_Cast2 894 days ago
Although K-means clustering is often the correct approach given time crunch and code complexity constraints, I don't like how it's hard to extend and how it's not principled. By not principled, I mean that it feels more like an algorithm (that happens to optimize) rather than an explicit optimization with an explicit loss function. And I found that in practice, modifying the distance function to anything more interesting doesn't work.
2 comments

K-means clustering is very well principled actually as an instance of the expectation maximization algorithm with "hard" cluster assignment. Turns out it's just good old maximum likelihood:

https://alliance.seas.upenn.edu/~cis520/dynamic/2022/wiki/in...

There are two issues I had in mind. One is that the link between argmin and the algorithm (k-means in this case) feels too "tied to the algorithm" and less explicit than in other algorithms.

The other is that in practice, you typically want to bring your true optimization objective as close as possible to what the algorithm is optimizing, and what k-means is optimizing for is usually pretty far removed. Even small tweaks (lets say, augmenting data with some sparse labels, or modifying the loss function weight based on some aspect of embedding values) are difficult to do with k-means.

Isn’t it also formally equivalent to a Gaussian mixture model?

https://timydaley.github.io/kmeans_gmm/gmm_vs_kmeans.html