| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bagels 886 days ago

You have to choose the number of clusters, before using k-means.

Imagine that you have a dataset, where you think there are likely meaningful clusters, but you don't know how many, especially where it's many-dimensioned.

If you pick a k that is too small, you lump unrelated points together.

If k is too large, your meaningful clusters will be fragmented/overfitted.

There are some algorithms that try to estimate the number of clusters or try to find the k with the best fit to the data to make up for this.

1 comments

keenmaster 886 days ago

Couldn’t you make some educated guesses and then stop when you arrive at a K that gives you meaningful clusters that are neither too high level nor too atomized.

link

yarg 886 days ago

Probably not the best in terms of efficiency.

Easier just to deliberately overshoot (with a too high k) and then merge any clusters with too much overlap.

link