Hacker News new | ask | show | jobs
by grph123dot 1177 days ago
Your explanation is crystal clear. I suppose it works well in practice, but is there any reason it works that well?
1 comments

Per the original paper, empirically it’s been found that neural network weights often have low intrinsic rank. It follows, then, that the change in the weights as you train also have low intrinsic rank, which means that you should be able represent them with a lower rank matrix.
Since we are in ELI5, it seems that the concept of low rank approximation is required to understand this method.

(1) https://en.wikipedia.org/wiki/Low-rank_approximation

Edited: By the way, it seems to me that there is an error in the wikipedia page because if the Low-rank approximation takes a larger rank then the bound of the error should decrease, and in this page the error increases.

>> that the change in the weights as you train also have low intrinsic rank

It seems that the initial matrix of weights has a low rank approximation A and this implies that the difference E = W - A is small, also it seems that PCA fails when E is sparse because PCA is designed to be optimum when the error is gaussian.

In terms of PCA, PCA is also quite expensive computationally. Additionally, you'd probably have to do SVD instead.

Since the weights are derived from gradient descent, yeah we don't really know what the distributions would be.

A random projection empirically works quite well for very high dimensions, and is of course very cheap computationally.

Does this mean the matrices are highly compressible?
kinda/yes. To translate to more intuitive concepts: the matrices don't contain much variance in as many degrees of freedom as they could.

Think of a point cloud of a piece of paper floating in the wind. It would be a 3xn list of points, but "really" it's a 2d piece of paper.

Just like I can rewrite the number 27 as 333 or 8+19 or (2^3)+(2^4)+3.. Given a single matrix one can find myriad ways to rewrite it as a sequence of matrices that have the same (or similar) numeric value, but with interesting or desirable properties. :D

My favorite example (which is used in signal processing) is to take your ugly matrix and rewrite it as a set of smaller matrices where most of the elements are zero, or a power of 2.

It turns out, computers can multiply by zeros and powers of two very fast