| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by haxton 1020 days ago

Curious to know what value you've seen out of these clusters. In my experience k means clustering was very lackluster. Having to define the number of clusters was a big pain point too.

You almost certainly want a graph like structure (overlapping communities rather than clusters).

But unsupervised clustering was almost entirely ineffective for every use case I had :/

2 comments

simonw 1020 days ago

I only got the clustering working this morning, so aside from playing around with it a bit I've not had any results that have convinced me it's a tool I should throw at lots of different problems.

I mainly like it as another example of the kind of things you can use embeddings for.

My implementation is very naive - it's just this:

    sklearn.cluster.MiniBatchKMeans(n_clusters=n, n_init="auto")

I imagine there are all kinds of improvements that could be made to this kind of thing.

I'd love to understand if there's a good way to automatically pick an interesting number of clusters, as opposed to picking a number at the start.

https://github.com/simonw/llm-cluster/blob/main/llm_cluster....

link

FreakLegion 1020 days ago

There are iterative methods for optimizing the number of clusters in k-means (silhouette and knee/elbow are common), but in practice I prefer density-based methods like HDBSCAN and OPTICS. There's a very basic visual comparison at https://scikit-learn.org/stable/auto_examples/cluster/plot_c....

link

stefanka 1019 days ago

You could also use a Bayesian version of kmeans. It applies a Dirichlet process as a prior to an infinite (truncated) set of clusters such that the most probable number k is automatically found. I found one implementation here: https://github.com/vsmolyakov/DP_means

Alternatively, there is a Bayesian GMM in sklearn. When you restrict it to diagonal Covariance matrices, you should be fine in high dimensions

link

stefanka 1019 days ago

Having close centers might help with the labeling. Let me know if I can help

link

nl 1020 days ago

Switch to using HDBSCAN. It's good.

link

haxton 1020 days ago

Elbow method is a good place to start for finding the number of clusters.

link

simonw 1020 days ago

That's a useful hint, thanks. I fed it through GPT-4 and got some interesting leads: https://chat.openai.com/share/400f76ae-b53b-4d07-ac31-adcef2... and https://chat.openai.com/share/48650db8-5a29-49c5-84b2-574f53...

link

visarga 1020 days ago

Use bottom up clustering, you get the whole tree. fclusterdata in scipy

link