| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by minimaxir 893 days ago
	You can use DBSCAN instead of k-means, but DBSCAN has a worst-case memory complexity of O(n^2) so things can get spicy with large datasets, which is why I opt it to only use it for subclusters. k-means also fixes the number of clusters, which is good for visualization sanity. https://scikit-learn.org/stable/modules/generated/sklearn.cl...

1 comments

Xenoamorphous 893 days ago

Isn’t the embedding step much slower than clustering? How many documents are you dealing with?

For I news aggregator I worked on I disregarded k-means because you have to know the number of clusters in advance, and I think it will cluster every document, which is bad for the actual outliers in a dataset.

Agglomerative clustering yielded the best results for us. HDBSCAN was promising but doing weird things with some docs.

link

whakim 893 days ago

The embedding step is certainly slower than clustering, but the memory requirements blow up pretty fast when you're doing density-based clustering on a dataset of even, say, 100k embeddings.

link