Hacker News new | ask | show | jobs
by bagels 886 days ago
You have to choose the number of clusters, before using k-means.

Imagine that you have a dataset, where you think there are likely meaningful clusters, but you don't know how many, especially where it's many-dimensioned.

If you pick a k that is too small, you lump unrelated points together.

If k is too large, your meaningful clusters will be fragmented/overfitted.

There are some algorithms that try to estimate the number of clusters or try to find the k with the best fit to the data to make up for this.

1 comments

Couldn’t you make some educated guesses and then stop when you arrive at a K that gives you meaningful clusters that are neither too high level nor too atomized.
Probably not the best in terms of efficiency.

Easier just to deliberately overshoot (with a too high k) and then merge any clusters with too much overlap.