Hacker News new | ask | show | jobs
by Xenoamorphous 893 days ago
Isn’t the embedding step much slower than clustering? How many documents are you dealing with?

For I news aggregator I worked on I disregarded k-means because you have to know the number of clusters in advance, and I think it will cluster every document, which is bad for the actual outliers in a dataset.

Agglomerative clustering yielded the best results for us. HDBSCAN was promising but doing weird things with some docs.

1 comments

The embedding step is certainly slower than clustering, but the memory requirements blow up pretty fast when you're doing density-based clustering on a dataset of even, say, 100k embeddings.