Hacker News new | ask | show | jobs
by Immortal333 2193 days ago
"We show that this architecture gives rise to universal learning capabilities in the limit, with effective model capacity increasing as a function of network size in a manner comparable with deep ReLU networks."

What exactly this statement means?

4 comments

> "We show that this architecture"

They are demonstrating a new technique, Gated Linear Networks.

> gives rise to universal learning capabilities in the limit

they claim to show that with an unbounded amount of time and memory (network size / # params) this architecture can be used to learn/approximate any function

> with effective model capacity increasing as a function of network size

Model capacity here refers to the ability to memorize a mapping between inputs and outputs. They show that a network with more layers/weights will "memorize" more.

> in a manner comparable with deep ReLU networks

"Deep ReLU networks" are referring to commonly used modern deep neural network architectures. ReLU is a popular activation function: https://en.wikipedia.org/wiki/Rectifier_(neural_networks)

They mean that if you add parameters, the learning capability of their approach grows by a similar amount as if you would add the same number of parameters to a conv+ReLu network (the standard approach).

That "universal" is a weird claim in my opinion, but they mean that with enough parameters, this architecture can learn everything.

I was able to get the second part of the statement. but I haven't seen the use of "in limit" in a statement like this.

Yes, the universal approximation is a strong claim. NN has been proven to have universal approximation theoretically.

The result here is stronger, in the sense that typical NN universality results are statements with respect to just capacity (and not how you optimise them). Here, the result holds with respect to both capacity + a choice of suitable no regret online convex optimisation algorithm (e.g. online gradient descent). Of course, this is just one desirable property of a general purpose learning algorithm.
Do you know of any other kinds of universal function approximators that also have a good regret bound like this, or is yours the first one?

global convergence to any arbitrarily weird function at rate O(sqrt(T)) seems amazing, almost too good to be true, and I’m wondering what the catch is. Maybe it’s just a moderately nice property but not extraordinary? Maybe there are some horrible constants hiding in there?

1) It's definitely not the first. Other methods have universal guarantees of some form or other with well quantified rates of convergence, e.g. k-NN would be the best known example.

2) There are some restrictions on the class of density functions it can model, so arbitrarily weird is a bit strong, but the model class is very general.

3) The weights needed to model any function in this class although finite, can be arbitrarily large. The regret of a single neuron has a dependence on the diameter of the convex set your weights reside in, so there is a nasty constant of sorts in there, and this will also carry over when you analyse the regret of a complete network. With such a general statement, it's unavoidable sadly.

4) The universality result on its own is just a nice property. See it as the first stepping stone to a more meaningful analysis. What you really want is for the model class to grow as you add more neurons, using weights within a realistic range, and that the method performs well in practice on some problems people care about -- we provide empirical evidence that the capacity grows on par with deep relu networks with our capacity experiments, and show a bunch of results where the method works, but we don't have a theoretical characterisation of the class of density functions it can model well (i.e. if the function has some nice structural property, then a network of reasonable size is guaranteed to learn a good approximation quickly). Such a result would be extraordinary in my eyes. Because the network is composed of simple and well understood building blocks, I am optimistic that such an analysis will be possible in the future.

thank you very much for the detailed response!

much to chew on here, it really does seem like a very interesting class of models. from the papers it sounds like in practice clipping weights to a small set works okay, so the constant factors shouldn't be too bad.

i may have to sit down and try to implement these...

Universal approximation doesn't seem that strong a claim at all, you could use simple polynomials to achieve just that (to some extent RELUs are polynomials as well, provided you use 0/1 or integer weights, but that's probably besides the point).

It would be much weirder for a function with lots of parameters to not have universal approximation in some sense, as it would imply that you lose some degrees of freedom somewhere.

"In limit" is just shorthand for "as N tends to infinity" (not necessarily N, but you get the idea).
"in limit" here means as you approach the edge case of having an unlimited number of parameters.
Presumably (haven't read the paper yet) that their network provably becomes a universal function approximator in the limit of infinite size.

Reading... actually the proof seems to be in

https://arxiv.org/abs/1712.01897

As the network size increases, it can learn more complex functions. When the network gets bigger and bigger, it gets closer to being able to learn any arbitrary function.