Hacker News new | ask | show | jobs
by bckr 1913 days ago
Agreed. It's really more like playing with Lego. Take residual connections, for instance. The insight was that information wasn't traveling far enough when the networks got too deep. So they just.... Plugged the earlier layers into the later layers. And this has been a very important development.

Or things like batch norm. We don't know why it's important. People do math to try to explain what's going on, not so much to figure out where we should go next.

Related: Understanding is a poor substitute for convexity https://www.edge.org/conversation/nassim_nicholas_taleb-unde...

2 comments

> things like batch norm. We don't know why it's important

It is pretty well understood.

Posts like this really piss me off, because you can make anything sound small. So don't take the next few paragraphs as me attacking you, I'm just venting a general sentiment I've had for a while. (Looking back after typing it out, I might actually flesh it out and post it on my blog, I hopeit's somewhat thought stimulating for others as well).

It's like someone saying "well, electrical engineering is really like playing with building blocks. Take zener diodes for example, the problem was that you can have a lot of power in a circuit, but if there's a power spike it might break. So they just...plugged in a piece that breaks by shunting the power spike into the ground, then reset. And this is now a major piece of electronics." Or "so, everyone is always going off about arabian mathematicians, but one "big development" they did was to invent the zero - just make up a symbol where previously you'd leave a space. It's basically just a change of notation!".

Deep learning theory (statistical, information theoretical and optimisation wise) is our process of understanding how to design systems that adapt themselves to feedback, and how to encode tasks in them. Batchnorm was inspired by one thing (internal covariate shift), and that thing was plausible, but as it turns out, in systems as complex (not complicated, complex as in interactions) as universal function approximators, adding one thing can radically change things. As it turns out, batchnorm smooths the function, it decouples parameter magnitude and directions and it positively improves signal propagation. How else would you have figured this out without having systems like neural networks with batchnorm already in place that you can study? And now there are lines of work emerging that do away with batchnorm, but have distilled the positive properties into smaller techniques (https://arxiv.org/pdf/2102.06171.pdf, Soham De gave a lecture at our lab recently).

Same thing with skip connections: Jürgen Schmidhuber will rightfully point out we've had highway networks since his heyday, but details matter. It is really not intuitive before you do it that in such a complex system, skip connections will be beneficial, because before we had them and started studying them on complex system, the ideas of thinking of them as learning small adjustments to a signal, or as an ensemble of shallow learners or the other perspectives that they have been studied under had not been developed.

And how would you? Without having them working really well, you'd have to start thinking about them from first principles in the giant design space of nonlinear, nonconvex functions, without being able to prove anything because we don't have the mathematical formalism yet.

Deep learning theory and nonconvex optimisation right now is a new physics born out of the marriage of information theory, computer science and computer engineering (and not surprisingly in a menage a trois, a lot of groundwork was laid by the french and other weird europeans /joke). We have a bunch of theory nerds trying to explain what we see in elegant and concise mathematical frameworks and trying to come up with testable predictions, and a bunch of experimentational people actually coming up with ways of testing it, gluing together the bits of understanding we have with soft knowledge to make the learning engine go brr and give feedback to the theorists on what held up, what didn't work predictable and what didn't go according to predictions. And people mouth off about the empirical nature of things.

Well, I ask: How else would you figure this stuff out? I think there is a cult of genius at play here, where if you don't start with category theory and platonic ideal conceptions of reality and derive your model without any experiment, you are somehow lesser.

Well, despite what people like to sell, disruption is a lie, everything is incremental, and without having the hackers make things work in clunky ways, the theorists would circle jerk themselves in creative dead ends because of a lack of stimulus.

And as always, there are a lot of people who make themselves sounds smarter by affecting superiority and disdain on this scientific process, while in the background nerds deepen our understanding of the universe.