Cluster stability is a good heuristic that should be more well-known:
For a given k:
for n=30 or 100 or 300 trials:
subsample 80% of the points
cluster them
compute Fowlkes-Mallow score (available in sklearn) of the subset to the original, restricting only to the instances in the subset (otherwise you can't compute it)
output the average f-m score
This essentially measure how "stable" the clusters are. The Fowlkes-Mallow score decreases when instances pop over to other clusters in the subset.
If you do this and plot the average score versus k, you'll see a sharp dropoff at some point. That's the maximal plausible k.
edit: Here's code
def stability(Z, k):
kmeans = KMeans(n_clusters=k, n_init="auto")
kmeans.fit(Z)
scores = []
for i in range(100):
# Randomly select 80% of the data, with replacement
# TODO: without
idx = np.random.choice(Z.shape[0], int(Z.shape[0]*0.8))
kmeans2 = KMeans(n_clusters=k, n_init="auto")
kmeans2.fit(Z[idx])
# Compare the two clusterings
score = fowlkes_mallows_score(kmeans.labels_[idx], kmeans2.labels_)
scores.append(score)
scores = np.array(scores)
return np.mean(scores), np.std(scores)
For a given k:
This essentially measure how "stable" the clusters are. The Fowlkes-Mallow score decreases when instances pop over to other clusters in the subset.If you do this and plot the average score versus k, you'll see a sharp dropoff at some point. That's the maximal plausible k.
edit: Here's code