Hacker News new | ask | show | jobs
by tom_b 2450 days ago
Glad to see the article mention dropping in the fastcluster package to replace the default R hclust. I would suggest using the parallelDist library as well instead of the standard dist.

Clustering in general and hierarchical clustering have been something I have spent some recent time trying to come up to speed on. Current state of the art seems to be using graph-based approaches of community detection (Louvain method) where graphs are built for a set of samples with N features by starting with a graph of K-nearest-neighbors and assigning some weight to the edges.

1 comments

> Current state of the art seems to be using graph-based

Do you know how this compares to UMAP?

Nope. But I think UMAP would be a dimensionality reduction step that you fed into a graph-based community extraction algorithm? In our current code experiments, we have been starting with a k-nearest-neighbor approximation algorithm to build the graph. UMAP would, at first glance, replace that.

I think there is even a louvain clustering method (really a graph-based community extraction method) built into several UMAP libraries floating around . . .

I strongly suspect that it's worse than UMAP or Ivis. Graph based methods are great for some things but not clustering
We've found that graph-based community approaches have some really nice benefits in our bioinformatics data.

In particular, we have found that these approaches seem to preserve very small cluster structure "better" than traditional approaches. Meaning, we have a small group of cells that we know belong to their own cluster group and the graph-based community approaches preserve these "small" groups outside of other clusters nicely.

But we have also noticed (and had some feedback) that we windup with final modularity scores that are very high - greater than or equal to 0.90 (on a scale of -1 to 1). Applied math folks in the graph algorithms world kind of seem to look at that and go "eh, that is so high you should probably just do PCA and move on . . . "

Especially given that you could (and people seem to) use UMAP as a precursor to louvain methods, I'll probably be looking into UMAP to see how it goes. But our current performance bottleneck is that the clustering (or community approach of graphs with the louvain method) is our computational bottleneck, so we'd like to whittle that runtime down as much as possible.