| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by MagicMoonlight 970 days ago
	If it was shallow then it wouldn’t take 25 years for a human brain to fully train. The fact that some parts of it need that much data mean they must be way up the hierarchy.

3 comments

GranularRecipe 970 days ago

The reason for deep learning is that shallow networks are very hard (or impossible) to train. In that sense, long time of training is evidence for shallow networks.

link

IshKebab 970 days ago

No it's because shallow networks can't express complex functions. If you think about it the shallowest network is pretty much a lookup table. They can theoretically model any function, but the number of parameters needed means in practice they can't. Deep networks can learn much more complex functions for the same number of parameters.

link

simiones 969 days ago

> They can theoretically model any function, but the number of parameters needed means in practice they can't.

Even theoretically, no they can't. They can theoretically model any continuos function.

Plus, even for continuous functions, the theorem only proves that, for any function, there exists some NN that approximates it to arbitrary precision. It is not known whether there is some base NN + finite training set that could be used to arrive at that target NN using some algorithm in a finite number of steps.

link

nightski 969 days ago

I'm not sure it is all that interesting of a distinction seeing as non-continuous functions can be approximated by continuous ones (basically the entire premise of a digital computer).

link

simiones 969 days ago

I don't think this is right at all. Digital computers express non-continuous functions, and they sometimes use those to approximate continuous functions.

For example, for a function f(x) defined on R with f(x) = -x if x < 0 and f(x) = 7+x if x >= 0, how would you approximate it by a continuous function g(x) with precision lower than, say, 1 (i.e. |f(x) - g(x)| < 1 for any x in R)?

And of course, there are functions with much worse discontinuities than this.

link

IshKebab 969 days ago

He was talking about the actual signals in chips being continuous I believe.

link

GranularRecipe 969 days ago

This is not only an issue for shallow networks. As far I know, both points apply to all feed-forward networks regardless of depth.

link

simiones 969 days ago

Yes, both apply regardless of depth (as long as it is finite, I imagine).

link

PartiallyTyped 970 days ago

I mean… a 3 layer network is a Universal approximator… and you can very much do network distillation… it’s just that getting them wide enough to learn whatever we want them to isn’t computationally efficient. You end up with much larger matmuls which let’s say for simplicity exhibit cubic scaling in the dim. In contrast, you can stack layers and that comes much more computationally friendly because your matmuls are smaller.

Of course you then need to compensate with residuals, initialisation, normalisation, and all that, but it’s a small price to pay for scaling much much better with compute.

link

IshKebab 970 days ago

> a Universal approximator

Yes that's exactly my point. A lookup table is a universal approximator. Good luck making AI with LUTs.

It's kind of like the halting problem or the no-free-lunch theorem. Interesting academic properties but they don't really have any practical significance and often confuse people into thinking that things like formal verification and lossless compression are impossible.

link

GranularRecipe 970 days ago

What the ratio for the number of parameters required to learn some complex function between a shallow network and a deep network (preferably as a function of the complexity)?

link

numpad0 970 days ago

Is that to say something along, flapping wings is overcomplicated and stupid?

link

gwern 969 days ago

That doesn't follow. Shallow networks can be harder to train than a deep one, which is one of the old arguments for why you should train a deep NN despite its many disadvantages (like latency - often a matter of life and death for biological organisms!). The depth allows easier learning.

This is why today, if you need a low-latency NN, which means a shallow one, often your best bet is to train a deep one first and then distill or prune it down into a shallow one. Because the deep one is so much easier, while training a shallow one from scratch without relying on depth may be an open research question and effectively impossible.

link

ralfd 969 days ago

> If it was shallow then it wouldn’t take 25 years for a human brain to fully train.

It doesn't. You can speak perfectly fine with children. And in fact some teenagers think they know everything.

link