|
|
|
|
|
by PartiallyTyped
970 days ago
|
|
I mean… a 3 layer network is a Universal approximator… and you can very much do network distillation… it’s just that getting them wide enough to learn whatever we want them to isn’t computationally efficient. You end up with much larger matmuls which let’s say for simplicity exhibit cubic scaling in the dim. In contrast, you can stack layers and that comes much more computationally friendly because your matmuls are smaller. Of course you then need to compensate with residuals, initialisation, normalisation, and all that, but it’s a small price to pay for scaling much much better with compute. |
|
Yes that's exactly my point. A lookup table is a universal approximator. Good luck making AI with LUTs.
It's kind of like the halting problem or the no-free-lunch theorem. Interesting academic properties but they don't really have any practical significance and often confuse people into thinking that things like formal verification and lossless compression are impossible.