Per the original paper, empirically it’s been found that neural network weights often have low intrinsic rank. It follows, then, that the change in the weights as you train also have low intrinsic rank, which means that you should be able represent them with a lower rank matrix.
Edited: By the way, it seems to me that there is an error in the wikipedia page because if the Low-rank approximation takes a larger rank then the bound of the error should decrease, and in this page the error increases.
>> that the change in the weights as you train also have low intrinsic rank
It seems that the initial matrix of weights has a low rank approximation A and this implies that the difference E = W - A is small, also it seems that PCA fails when E is sparse because PCA is designed to be optimum when the error is gaussian.
kinda/yes. To translate to more intuitive concepts: the matrices don't contain much variance in as many degrees of freedom as they could.
Think of a point cloud of a piece of paper floating in the wind. It would be a 3xn list of points, but "really" it's a 2d piece of paper.
Just like I can rewrite the number 27 as 333 or 8+19 or (2^3)+(2^4)+3.. Given a single matrix one can find myriad ways to rewrite it as a sequence of matrices that have the same (or similar) numeric value, but with interesting or desirable properties. :D
My favorite example (which is used in signal processing) is to take your ugly matrix and rewrite it as a set of smaller matrices where most of the elements are zero, or a power of 2.
It turns out, computers can multiply by zeros and powers of two very fast