Hacker News new | ask | show | jobs
by radq 1089 days ago
We were missing two architecture patterns that were needed to get deeper nets to converge: residual nets [1] which solved gradient propagation, and batch normalization [2] which solved initialization.

[1] Residual nets (2015): https://arxiv.org/abs/1512.03385

[2] Batch normalization (2015): https://arxiv.org/abs/1502.03167

3 comments

Also quasi-linear activation functions (prevent vanishing gradients), tons of regularisation (e.g convolutions) and more adaptive gradient descent (faster convergence). I've still met people in the early 2010s who tried to make neural networks work using only a few dozen units. Academia is pretty slow. What people also forget is that libraries like pytorch or tensorflow simply didn't exist. I wrote my own neural network stacks complete with backpropagation from scratch in c++ back then.
LeCun et al (1989) had backprop working for digit recognition.

LeCun, Bottou, et al (2002) in "Efficient Backprop" described techniques for improving backprop algorithms.

Rosenblatt had a working perceptron for classifying images in the 1950s (!). And yet it took 60 years before the theory and compute power had developed enough for all of this to be interesting outside of small, purely academic experiments.
Handwriting recognition on checks (LeCun et al 1989) wasn't really a small, purely academic experiment
And yet classical OCR techniques continued to dominate. Nothing happened in the industry on that front for over 20 years. That's as academic as it gets.
Yes, but the tweet is talking about single layer networks!
AlexNet predated that though.