Hacker News new | ask | show | jobs
by joe_the_user 1839 days ago
It seems like this can leave the reader with the wrong impression. Calculus really is "the mathematics of Newtonian physics". This is just "some mathematics that might help a bit in your intuitions of deep learning".

IE, Deep learning is fundamentally just about getting the mathematically simple but complex and multi-layerd "neural networks" to do stuff. Training them, testing them and deploying them. There are many intuitions about these things but there's no complete theory - some intuitions involve mathematical analogies and simplifications while other involve "folk knowledge" or large scale experiments. And that's not saying folks giving math about deep learning aren't proving real things. It's just they characterizing the whole or even a substantial part of such systems.

It's not surprising that a complex like a many-layered Relu network can't fully characterized or solved mathematically. You'd expect that of any arbitrarily complex algorithmic construct. Differential equations of many variables and arbitrary functions also can't have their solutions fully characterized.

3 comments

As a PhD student who sort of burned out on this type of research, I agree that the complexity of Neural Networks as a mathematical construct makes them very difficult to analyze. This might also have to do with Deep learning theory being a subset of learning theory which is subject to "No Free Lunch" [1], which means that you always have to be very careful not to try to prove something that turns out to be impossible.

That being said, research on the Kernel regime is one of the very cool ideas, in my opinion, to gain traction in this field in the past few years. To summarize: "If you make a neural network wide enough, it gains the power to control its output on each individual input separately, and will begin to fit its training data perfectly". Of course, the real pleasure is in understanding all the mathematical details of this statement!

[1] : https://en.wikipedia.org/wiki/No_free_lunch_theorem

I got my master's years ago so now I'm a strict amateur. That said, I don't think the "No free lunch theorem" is very "interesting". It's nearly tautological that no approximation method works for "any" function. The set of predictable/interesting/useful/"real-world" functions is going to have measure 0 compared to white noise so "any function" will basically look like white noise and can't be predicted. Approximating functions/sequences with vanishingly low Kolmogorov complexity is more interesting, impossible in general by Godel's theorem but what's the case "on average"? (depends on the choice process and so ill-defined but defining might be interesting). The kernel regime stuff looks interesting but I don't know it's relation to wide networks.

Neural networks "tend to generalize well in the real world". That's a pretty fuzzy statement imo since "real world" is hardly defined but it's still what people experience and it's more useful to provide a more precise model where this works rather than a model where this doesn't work.

Also, there's good theory on deep networks as universal well as theories of wide/shallow networks [1].

[1]: https://arxiv.org/abs/1901.02220

> Neural networks "tend to generalize well in the real world".

I've always interpreted that as "we've found an algorithm that could, given a foreseeable amount of computing power and maybe some tweaks, simulate human decision making".

It isn't so much that neural networks can approximate the real world as they can approximate human perception of the real world.

Well, I quote the statement to show how vague it is, among things.

Neural networks are "universal approximators" in that they work as well as virtually any previous approximation method. So given big snapshot of input data and human judgement on it, they can approximate that. They can also approximate a snapshot of some input-output pairs not produced by human but having patterns (solutions to differential equations, for example).

So, they can approximate what humans do in a given domain. But there's no reason to think they're acting in the same way as humans and I'd say very few people seriously working on ML believe that.

This intuition is very dangerous and leads to huge misconceptions about deep neural nets.

Neural nets don't learn anything like us, and they don't reproduce our functions. We build on massive amounts of general symbolic knowledge, and can zero shot tasks (without explicit examples) easily.

Neural networks really should be seen as just giant random functions that you progressively modify in tiny ways until they fit your data. As parent says, we've just been lucky or good at constraining these functions in a way that they can only learn useful functions (ie convnets) or that they somehow learn these more quickly

Humans certainly do not build on massive amounts of symbolic knowledge because we are absolutely terrible at symbolic knowledge. Reliably reasoning through a basic logical argument is a specialist skill. Even reviewing evidence before making decisions is uncommon, most humans operate on a look -> assess -> do model where the tricky bit is well approximated by a neural net. Which is why neural nets seem to be so good at real-world tasks.

It is completely plausible that when neural nets get scaled up to something approaching human-brain numbers of connections they will well approximate a human brain or be a few tweaks away. Obviously it won't be knowable until state of the art gets there, but there is no reason to think human intelligence is going to be complicated. It is one evolutionary step up from some pretty basic animals.

Maybe you’re talking about a different kind of symbolic knowledge than the OP. To give an example humans can instantly tell whether an arbitrary sentence is grammatical or not which is a deep kind of symbolic reasoning that computers absolutely cannot do right now. And humans can also get the semantic meaning.

Then again math is hard for us. So I think there are nuances.

Neural nets fundamentally cannot operate the same way a brain does, because they cannot create an abstract representation of a problem, and then gradually and deliberately manipulate that mental model until they develop a solution. They just don't work that way, with current structures. They basically apply a single pass of a very complex function to the data, and spit out a result.

That isn't a problem of scale, it's a problem of architecture. This is one of the reasons Deepmind decided to tackle Starcraft. It's very difficult to solve Starcraft without your AI having some ability to develop and then manipulate a mental model of the game, because that's what you need to construct and unfold original, non-linear strategies.

Neural nets generalise because they have to approximate the data at a lower resolution, it's not that they're constrained to only learn what is useful. They're lossy compressors, but they have a unique property that most lossy compressors don't have. They cannot learn all the properties of the input data - partly because they can't hold that much information - but uniquely because neurons cannot be modified in isolation. A change in one neuron changes the influence of every other neuron in that layer, on the next layer. So it's difficult to learn granular properties of specific examples, because the entire net is affected when you do that (and many granular properties that are learned, will be unlearned in subsequent examples). The deeper the net, the less able earlier layers are to extract granular information from the input. They have to extract very abstract information, and they will gradually converge on an abstraction strategy that works.

That's why residual blocks are interesting. They pass that low-level information to later blocks (which have an easier time processing the granular details) while also leveraging the ability of earlier blocks to extract abstract information. It allows you to extract and combine information at multiple levels of granularity (or abstraction).

Convnets are also invariant to generalisation (e.g. translation, and to some degree scale), which I think is a better definition than "can only learn something useful." They're forced learn information that is more general, which increases the usefulness of each bit, which means you get a higher density of usefulness per FLOP. But you also lose specific information in that process. What if location is meaningful? For example, audio spectrogram analysis can suffer from that property, because specific location on the Y axis is highly meaningful.

What I meant by "forced to learn something useful" is what you put in a more clear way by being forced to generalize.
NFL theorems aren’t an argument about noise, they’re an argument about the uncountability of real numbers. NFL states that over all problems any optimization method performs equally poorly to any other, or equivalently, _that if an optimization method does well on some problems, it must do equally poorly on some other problems_, and those others aren’t necessarily noise, they could be anything. The problem is you don’t know which problems it is going to do poorly on in advance. You hope it does poorly on noise or on problems that you don’t care about, but you can’t tell. That is a very different statement than what you’re saying, and it’s as equally non-trivial as Godels and Turings statements in decidability.
It seems like it aims at giving somebody who would like to get started doing theoretical research in the field some pointers and basic insights. I don't think it does a particularly bad job at this, in particular given that it will be a book chapter? The target audience are probably people who have had some exposure to Functional Analysis and the likes before.
There are a few works that try to put deep learning on some theoretical basis, I like this one, for example:

https://arxiv.org/abs/1703.00810

This goes beyond mere intuition, but it is also still very far from a “complete theory”.

I find it disappointing that so few people in deep learning work on the theoretical foundations.

What are some subfields of mathematics that you would say are crucial for gaining a proper understanding of all the things related to deep learning (e.g. let's say the paper you linked)? Even though the theory isn't complete, I'm sure a grounding in certain fields of mathematics will be helpful.
This is always difficult to answer, and it will probably be a mixture of many, however I am currently following categorical approaches to machine learning. Category Theory is the area of mathematics that studies composable structures, i.e. like layers in a deep network. It is very abstract and was invented to solve problem in algebraic geometry, but has been fruitful in other areas as well.
That you this illustrates that the situation today is "take whatever math-stuff you have, throw it against neural networks and see what you get". IE, I'm pretty sure not much progress has been made with category theory and neural networks - but you might be the first.

I've seen differential equations, Markov chains, differential geometry and other stuff. We might be in heady days before the "big breakthrough" is made. But these constructs might be inherently pathological (even then, non-pathological variants might be possible).

It’s good for paper publication, and for public sector. Take an obscure area of math and throw it against latest trend.

It’s a question if it’s useful.

It can be useful for innovation in the aggregate.
Could you give some favourite references, some use of category theory in ML which gives good results compared to standard approaches?

Is there a group doing this in Zurich?

Dynamical systems and chaos theory (especially for neural networks), information theory (especially for the paper linked), probability theory (especially the more foundational and axiomatic work)
You can start from this. https://arxiv.org/abs/1603.04929
Of the many "understanding neural networks" papers this is one of the few valuable ones.
Agreed. Until we get to the point where there are theorems of the form, for example, "Given a problem satisfying conditions X, the optimal number of layers to minimize expected training time for data satisfying Y is Z", it is just stamp collecting.