Hacker News new | ask | show | jobs
by sonium 894 days ago
I did something similar (but not for documents) but I’m struggling with selecting the optimal number of clusters.
3 comments

Cluster stability is a good heuristic that should be more well-known:

For a given k:

  for n=30 or 100 or 300 trials:
    subsample 80% of the points
    cluster them
    compute Fowlkes-Mallow score (available in sklearn) of the subset to the original, restricting only to the instances in the subset (otherwise you can't compute it)
  output the average f-m score
This essentially measure how "stable" the clusters are. The Fowlkes-Mallow score decreases when instances pop over to other clusters in the subset.

If you do this and plot the average score versus k, you'll see a sharp dropoff at some point. That's the maximal plausible k.

edit: Here's code

  def stability(Z, k):
    kmeans = KMeans(n_clusters=k, n_init="auto")
    kmeans.fit(Z)
    scores = []
    for i in range(100):
        # Randomly select 80% of the data, with replacement
        # TODO: without
        idx = np.random.choice(Z.shape[0], int(Z.shape[0]*0.8))
        kmeans2 = KMeans(n_clusters=k, n_init="auto")
        kmeans2.fit(Z[idx])

        # Compare the two clusterings
        score = fowlkes_mallows_score(kmeans.labels_[idx], kmeans2.labels_)
        scores.append(score)
    scores = np.array(scores) 
    return np.mean(scores), np.std(scores)
A simple metric for that is the Silhouette

https://en.m.wikipedia.org/wiki/Silhouette_(clustering)

Another elegant method is the Calinsky-Harabasz Index

https://en.m.wikipedia.org/wiki/Calinski%E2%80%93Harabasz_in...

Checkout hdbscan