| HN Mirror

Cluster stability is a good heuristic that should be more well-known:

For a given k:

  for n=30 or 100 or 300 trials:
    subsample 80% of the points
    cluster them
    compute Fowlkes-Mallow score (available in sklearn) of the subset to the original, restricting only to the instances in the subset (otherwise you can't compute it)
  output the average f-m score

This essentially measure how "stable" the clusters are. The Fowlkes-Mallow score decreases when instances pop over to other clusters in the subset.

If you do this and plot the average score versus k, you'll see a sharp dropoff at some point. That's the maximal plausible k.

edit: Here's code

  def stability(Z, k):
    kmeans = KMeans(n_clusters=k, n_init="auto")
    kmeans.fit(Z)
    scores = []
    for i in range(100):
        # Randomly select 80% of the data, with replacement
        # TODO: without
        idx = np.random.choice(Z.shape[0], int(Z.shape[0]*0.8))
        kmeans2 = KMeans(n_clusters=k, n_init="auto")
        kmeans2.fit(Z[idx])

        # Compare the two clusterings
        score = fowlkes_mallows_score(kmeans.labels_[idx], kmeans2.labels_)
        scores.append(score)
    scores = np.array(scores) 
    return np.mean(scores), np.std(scores)