Hacker News new | ask | show | jobs
by 1971genocide 3904 days ago
As a mathematical peasant who does modelling.

# If you have a small data set - you set notation.

# If you have 100s of larger data set - try to formulate your problem as a graph theory problem - Its easy to walk small graphs.

# If you have 10,000 data points - start to think of your problem in terms of matrix.

I know there is going to strong disagreement about my rule of thumb but I found it useful for many problems.

1 comments

Fair enough. Graph theory has a collection of quite well-known algorithms that run fast, which mostly come down to clever manipulations of depth-first and breadth-first search algorithms. These typically run in O(n+k), where n is the number of vertices and k is the number of edges. In large networks, it's still often the case that k is proportional to n (most people have less than 1000 facebook friends, so we can bound k by 1000n...), so this is still linear in n.

I tend to see lots of graph algorithms which run in O(n^3) time, which tends to be fine for medium-sized data sets, but yeah, breaks down when you get into the 10,000+ sized data sets.

And then there's 'matrix' methods you mention. These can often be tied to an area called algebraic graph theory, which looks at properties of matrices (like the adjacency matrix) to try to pull out information about the graph. Relaxing the graph problem and allowing the full range of matrix operations allows some fast linear algebraic methods to be used, though the final solution may need some massaging to make sense, and may not be perfectly optimal.

A great example of this is spectral clustering, which uses eigenvectors of a relative of the adjacency matrix to perform a clustering of the graph's nodes. The optimization that the clustering is solving is actually an (iirc) NP-hard problem; casting it as an eigenvalue problem gives an O(n^2) approximation of a solution, which then gets hammered into something we can live with...