Hacker News new | ask | show | jobs
by boombard 4219 days ago
Nice article. You could consider maximum spanning trees as a way to prune your correlation graph; they are very effective at suggesting underlying structure or kinetics of a system. Just use the minimum spanning tree algorithm with the inverse of your correlation.

[1] http://en.wikipedia.org/wiki/Spanning_tree

Another approach is to use PCA on the adjacency matrix. This can generate interesting clusters based on the latent variables. At the risk of self promotion I co-authored a paper on this technique which validated known pathways in a metabolic network

[2] http://www.biomedcentral.com/1471-2105/13/197

Anyway this is a great field to explore, glad to see it getting traction on HN!

1 comments

A maximum spanning tree might be misleading, as it's easy to interpret no vertex as no correlation. When building a tree, weak correlations may be included out of necessity, while stronger ones that lead to cycles are omitted.

If several dimensions are correlated just about equally strongly, you can get very different trees based on small random variation. There's no guarantee that all significant correlations are displayed, or that correlated dimensions are visually close to one another.

I agree, it's not perfect - just a useful abstraction. Just the same as arbitrary thresholds for correlation or a p<0.05 significance level - often you lose information but gain insight. From personal experience I've seen MST's map out underlying structures that validate classical chemical kinetics of a system in a logical path: something that would not have been apparent in ordinary thresh-holding approaches

Basically IMO it's good to use all of these techniques together to get a good picture of your system. In the end the greatest limitation is our human cognition to interpret the results, which frankly needs all the help it can get.

Thank you for the feedback. I prefer to use a graph instead of a tree because I want to spot clusters of relations.