| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by minimaxir 893 days ago

I built a pipeline to automatically cluster and visualize large amounts of text documents in a completely unsupervised manner:

- Embed all the text documents.

- Project to 2D using UMAP which also creates its own emergent "clusters".

- Use k-means clustering with a high cluster count depending on dataset size.

- Feed the ChatGPT API ~10 examples from each cluster and ask it to provide a concise label for the cluster.

- Bonus: Use DBSCAN to identify arbitrary subclusters within each cluster.

It is extremely effective and I have a theoetical implementation of a more practical use case to use said UMAP dimensionality reduction for better inference. There is evidence that current popular text embedding models (e.g. OpenAI ada, which outputs 1536D embeddings) are way too big for most use cases and could be giving poorly specified results for embedding similarity as a result, in addition to higher costs for the entire pipeline.

10 comments

pseudonom- 893 days ago

Funny, I did almost the exact same thing: https://github.com/colehaus/hammock-public. Though I project to 3D and then put them in an interactive 3D plot. The other fun little thing the interactive plotting enables is stepping through a variety of clustering granularities.

dleeftink 893 days ago

Thanks for sharing. I 'd like to know what the (re)compute time might be when adding, say, another million documents using this pipeline. The cluster embedding approach in my view, while streamlined, still adds a (sometimes significant) timebump when high throughput is required.

I see some significant speedups can be achieved when discretising dimensions into buckets, and doing a simple frequency count of associated buckets -- leaving only highly related buckets per document. These 'signatures' can then be indexed LSH style and a graph construed from documents with similar hashes.

When the input set is sufficiently large, this graph contains 'natural' clusters, without any UMAP or k-means parameter tuning required. When implemented in BQ, I achieve sub minute performance for 5-10 million documents, from indexing to clustering.

sshumaker 893 days ago

You can also look at Bertopic which has this functionality as an open source library:

https://maartengr.github.io/BERTopic/index.html

sonium 893 days ago

I did something similar (but not for documents) but I’m struggling with selecting the optimal number of clusters.

bravura 893 days ago

Cluster stability is a good heuristic that should be more well-known:

For a given k:

  for n=30 or 100 or 300 trials:
    subsample 80% of the points
    cluster them
    compute Fowlkes-Mallow score (available in sklearn) of the subset to the original, restricting only to the instances in the subset (otherwise you can't compute it)
  output the average f-m score

This essentially measure how "stable" the clusters are. The Fowlkes-Mallow score decreases when instances pop over to other clusters in the subset.

If you do this and plot the average score versus k, you'll see a sharp dropoff at some point. That's the maximal plausible k.

edit: Here's code

  def stability(Z, k):
    kmeans = KMeans(n_clusters=k, n_init="auto")
    kmeans.fit(Z)
    scores = []
    for i in range(100):
        # Randomly select 80% of the data, with replacement
        # TODO: without
        idx = np.random.choice(Z.shape[0], int(Z.shape[0]*0.8))
        kmeans2 = KMeans(n_clusters=k, n_init="auto")
        kmeans2.fit(Z[idx])

        # Compare the two clusterings
        score = fowlkes_mallows_score(kmeans.labels_[idx], kmeans2.labels_)
        scores.append(score)
    scores = np.array(scores) 
    return np.mean(scores), np.std(scores)

rmellow 893 days ago

A simple metric for that is the Silhouette

https://en.m.wikipedia.org/wiki/Silhouette_(clustering)

Another elegant method is the Calinsky-Harabasz Index

https://en.m.wikipedia.org/wiki/Calinski%E2%80%93Harabasz_in...

jasonjmcghee 893 days ago

Checkout hdbscan

bart_spoon 893 days ago

When doing DBSCAN on the subclusters, do you cluster on the 2-D projected space? Do you use the original 2-D projection you used prior to k-means, or does each subcluster get its own UMAP projection?

minimaxir 893 days ago

I DBSCAN in the 2D projected space.

These aren't visualized: I use identified clusters to look at manually to find trends.

appplication 893 days ago

Is it possible to dbscan on the unprojected space or does that lead to poor effectiveness? Also what led you to choose dbscan vs another technique?

minimaxir 893 days ago

Poor effectiveness. (again another hint why working in high dimensional space may not be ideal)

I was not aware of a robust clustering technique that's better/as easy to use other than DBSCAN.

vslira 893 days ago

Any reason to pick DBSCAN instead of HDBSCAN*?

potatoman22 893 days ago

Interesting. What do you use the visualization for? Looking at trends in the documents?

minimaxir 893 days ago

Let's say you want to look at a large dataset of user-submitted reviews for you app. User reviews are written extremely idiosyncratic so all traditional NLP methods will likely fail.

With the pipeline mentioned, it's much easier to look at cluster density to identify patterns and high-level trends.

jncfhnb 893 days ago

Why not just use DBSCAN though

minimaxir 893 days ago

You can use DBSCAN instead of k-means, but DBSCAN has a worst-case memory complexity of O(n^2) so things can get spicy with large datasets, which is why I opt it to only use it for subclusters. k-means also fixes the number of clusters, which is good for visualization sanity.

https://scikit-learn.org/stable/modules/generated/sklearn.cl...

Xenoamorphous 893 days ago

Isn’t the embedding step much slower than clustering? How many documents are you dealing with?

For I news aggregator I worked on I disregarded k-means because you have to know the number of clusters in advance, and I think it will cluster every document, which is bad for the actual outliers in a dataset.

Agglomerative clustering yielded the best results for us. HDBSCAN was promising but doing weird things with some docs.

whakim 893 days ago

The embedding step is certainly slower than clustering, but the memory requirements blow up pretty fast when you're doing density-based clustering on a dataset of even, say, 100k embeddings.

mr_mitm 893 days ago

Which libraries are you using, in particular for the first step?

minimaxir 893 days ago

Embeddings is just SentenceTransformers: https://www.sbert.net/

I used the bge-large-en-v1.5 model (https://huggingface.co/BAAI/bge-large-en-v1.5) because I could, but the common all-MiniLM-L6-v2 model is sufficient. The trick is to batch generate the embeddings on a GPU, which SentenceTransformers mostly does by default.

Other libraries are the typical ones (umap for UMAP, scikit-learn for k-means/DBSCAN, chatgpt-python for ChatGPT interfacing, plotly for viz, pandas for some ETL). You don't need to use a bespoke AI/ML package for these workflows and they aren't too complicated.

refulgentis 893 days ago

It's just SentenceTransformers, but: the wrong model is common because no one read SentenceTransformers. MiniLM-L6-V2 is for symmetric search (target document has same wording as source document) MiniLM-L6-V3 is for asymmetric search (target document is likely to contain material matching query in source document)

tomthe 893 days ago

Can you share your chatGPT prompt, please? I do something similar at the moment and I try out Bert topic, but chatGPT seems also worth a try.

mike_ivanov 893 days ago

Why 2D? (edit: just the vis or there is some other reason?)

minimaxir 893 days ago

Both the viz, and that the 2D UMAP projection is actually enough to get accurately delineated topics.

Hence why I think the typical embedding dimensionality is way way too high.

markisus 893 days ago

Do you think 1D could work? Maybe topic-space is some sort of tree-shaped structure where documents live in the thin strands.

minimaxir 893 days ago

1D could work on certain datasets but it wouldn't be ideal.

adammarples 893 days ago

Why not just embed directly to 2d? Does it give worse results than UMAP?

visarga 893 days ago

cluster naming was still an open problem pre-LLM