|
|
|
|
|
by bravura
894 days ago
|
|
Cluster stability is a good heuristic that should be more well-known: For a given k: for n=30 or 100 or 300 trials:
subsample 80% of the points
cluster them
compute Fowlkes-Mallow score (available in sklearn) of the subset to the original, restricting only to the instances in the subset (otherwise you can't compute it)
output the average f-m score
This essentially measure how "stable" the clusters are. The Fowlkes-Mallow score decreases when instances pop over to other clusters in the subset.If you do this and plot the average score versus k, you'll see a sharp dropoff at some point. That's the maximal plausible k. edit: Here's code def stability(Z, k):
kmeans = KMeans(n_clusters=k, n_init="auto")
kmeans.fit(Z)
scores = []
for i in range(100):
# Randomly select 80% of the data, with replacement
# TODO: without
idx = np.random.choice(Z.shape[0], int(Z.shape[0]*0.8))
kmeans2 = KMeans(n_clusters=k, n_init="auto")
kmeans2.fit(Z[idx])
# Compare the two clusterings
score = fowlkes_mallows_score(kmeans.labels_[idx], kmeans2.labels_)
scores.append(score)
scores = np.array(scores)
return np.mean(scores), np.std(scores)
|
|