Hacker News new | ask | show | jobs
by brianleb 886 days ago
So after re-reading your comment a few times, I am left with this thought: either you don't understand what k-means clustering is, or I don't understand what k-means clustering is. I wouldn't describe myself as a machine learning expert, but I have taken some grad level classes in statistics, analytics, and methods like this/related to this.

So my question is... could you elaborate?

3 comments

Not GP, but I understood their question as follows.

Assume you collect some kindergartners and top NBA players into a room and collect their heights. Now say you pass these to two hapless grad students and ask them to perform K-means clustering.

Suppose one of the grad students knew the composition of the people you measured and can guess these height should clump into 2 nice clusters. The other student who doesn't know the composition of the class - what should they guess K to be?

I understood the GP's comment to refer to the state of the second grad student. How useful is K-means clustering without knowing K in advance?

>I understood the GP's comment to refer to the state of the second grad student. How useful is K-means clustering without knowing K in advance?

There are several heuristics for this. Googling I see that the elbow method, average sillhouette method and gap statistic method is the most used.

I think you could play around with your own heuristics as well. Simple KDE plots showing the amount of peaks. Maybe, say the variance between clusters should be greater than the variance inside any cluster could maybe work. (Edit: this seems to be the main point of the average sillhouette method).

The first problem is picking k. The second problem is the definition of distance or, equivalently, the uniformity of the space. Naively using the Euclidean distance in an embedding where similarity is non-uniform leads to bad outcomes. This problem is solved by learning a uniform embedding, and this is much harder than running k-means.

k-means assumes these hard parts of the problem are taken care of and offers a trivial solution to the rest. Thanks for the help, I'll cluster it myself.

You have to choose the number of clusters, before using k-means.

Imagine that you have a dataset, where you think there are likely meaningful clusters, but you don't know how many, especially where it's many-dimensioned.

If you pick a k that is too small, you lump unrelated points together.

If k is too large, your meaningful clusters will be fragmented/overfitted.

There are some algorithms that try to estimate the number of clusters or try to find the k with the best fit to the data to make up for this.

Couldn’t you make some educated guesses and then stop when you arrive at a K that gives you meaningful clusters that are neither too high level nor too atomized.
Probably not the best in terms of efficiency.

Easier just to deliberately overshoot (with a too high k) and then merge any clusters with too much overlap.