|
|
|
|
|
by minimaxir
893 days ago
|
|
You can use DBSCAN instead of k-means, but DBSCAN has a worst-case memory complexity of O(n^2) so things can get spicy with large datasets, which is why I opt it to only use it for subclusters. k-means also fixes the number of clusters, which is good for visualization sanity. https://scikit-learn.org/stable/modules/generated/sklearn.cl... |
|
For I news aggregator I worked on I disregarded k-means because you have to know the number of clusters in advance, and I think it will cluster every document, which is bad for the actual outliers in a dataset.
Agglomerative clustering yielded the best results for us. HDBSCAN was promising but doing weird things with some docs.