Hacker News new | ask | show | jobs
by woopwoop 1514 days ago
Honestly at this point it kind of is magic. These things are knocking out astonishing novel tasks every month, but the state of our knowledge is "why does sgd even work lol". There is no coherent theory.
4 comments

> "why does sgd even work lol"

I find this hand a little over played.

It depends on the degree of fidelity we demand of the answer and how deep we want to go questioning the layers of answers. However, if one is happy with a LOL CATS fidelity, which suffices in many cases, we do have a good enough understanding of SGD -- change the parameters slightly in the direction that makes the system work a little bit better, rinse and repeat.

No one would be astonished that using such a system leads to better parameter settings than ones starting point, or at least not significantly worse.

Its only when we ask more questions, ask deeper questions that we get to "we do not understand why SGD works so astonishingly well"

Yeah I didn't mean to imply "Why does SGD result in lower training loss than the initial weights" is an open question. But I don't think even lolcatz would call that a sufficient explanation. After all if the only criterion is "improves on initial training loss" you could just try random weights and pick the best one. The non-convexity makes sgd already pretty mysterious, and that is without even getting into the generalization performance, which seems to imply that somehow sgd is implicitly regularizing.
With over-parameterized neural networks, the problem essentially becomes convex and even linear [1], and in many contexts provably converges to a global minimum [2], [3].

The question then becomes: why does this generalize [4], given that the classical theory of Vapnik and others [5] becomes vacuous, no longer guaranteeing lack of over-fitting?

This is less well understood, although there is recent theoretical work here too.

[1] Lee et al (2019). Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. https://proceedings.neurips.cc/paper/2019/hash/0d1a9651497a3...

[2] Allen-Zhu et al (2019). A convergence theory for deep learning via over-parameterization. https://proceedings.mlr.press/v97/allen-zhu19a.html

[3] Du et al (2019). Gradient Descent Finds Global Minima of Deep Neural Networks. http://proceedings.mlr.press/v97/du19c.html

[4] Zhang et al (2016). Understanding deep learning requires rethinking generalization.

[5] Vapnik (1999). The nature of statistical learning theory. https://arxiv.org/abs/1611.03530

I dont disagree, except perhaps the lolcatz's demand for rigour. Improve with small and simple steps till you cant is not a bad idea after all.

BTW your randomized algorithm with a minor tweak is surprisingly (unbelievably) effective -- randomize the weights of the hidden layers, do a gradient descent on just the final layer. Note the loss is even convex in the last layer weights if matching/canonical activation function is used. In fact you dont even have to try different random choices, but of course that would help. The random kitchen sink line of results are a more recent heir to this line of work.

I suspect that you already know this and the fact that the noise in SGD does indeed regularize and the way it does so for convex function has been well understood since the 70s, so I am leaving this tidbit for others who are new to this area.

Why are there so few local minima, you mean?

I think it’d have to be related to the huge number of dimensions it works on. But I have no idea how I’d even begin to prove that.

Its not even certain that they are few. Whats rather unsettling is that with these local moves of SGD the parameters settle on a good enough local minima in spite of the fact that we know that many local minima exists that have zero or near zero training loss. There are glimmers or insight here and there but the thing is yet to be fully understood
Honestly at this point it kind of is magic.

How much of that magic is smoke and mirrors? For example, the First Tech Challenge (from FIRST Robotics) used Tensor Flow to train a library to detect the difference between a white sphere vs a golden cube using a mobile phone's on-board camera.

The first time I saw it, it did seem pretty magical. Then in testing realized it was basically a glorified color sensor.

I think these things make for great and astonishing demos but don't hold up to their promise. Happy to hear real-world examples that I can look into though.

Even if it were practically useless (which it is not, although the practical applications are less impressive than the research achievements at this point), it would be magical. Deep learning has dominated imagenet for a decade now, for example. One reason this is magical is because the sota models are extremely over parametrized. There exist weights that perform perfectly on the training data but give random answers on the test data [0]. But in practice these degenerate weights are not found during sgd. What's going on there? As far as I know there is no satisfying explanation.

[0] https://arxiv.org/abs/1611.03530

If you look at these “degenerate” parameterizations, they’re clearly islands in the sea of weight parameter space. It’s clear that what you’re searching for is not a “minimum” per say but an amorphous fuzzy blobby manifold. Think of it like sculpting a specific 3D shape out of clay. Sure there are exact moves to sculpt the shape, but if you’re just gently forming the clay you can get very close to the final form but still have some rough edges.

As for a formal analysis, I just can’t imagine there existing a formal analysis of ML that can describe the distinctly qualitative aspects of it. It’s like coming up with physics equations to explain art.

I mentored an FTC team that was using the vision system this year, and my overall impression was that the TensorFlow model was absolute garbage and probably performed worse than a simple "identify blobs by color" algorithm would have.

The vision model was tolerably decent at tracking incremental updates to object positioning, but for some reason would take 2+ seconds to notice that a valid object was now in view (which is quite a lot, in the context of a 30s autonomous period), and frequently identified the back walls of the game field as giant cubes.

there's a big difference between a glorified color sensor and a well trained deep learning library (I can say this with authority because I hired an intern at Google to help build one of those detectors). It's still not magic, but a well-trained network is robust and generalizable in a way that a color sensor cannot be.
It depends on the angle that people approach solving the problem. In my current field in cancer biology / drug response, people don't often know the features well enough comparing to normal everyday features such as natural images or natural text. In that setting the understanding of the feature space / biological systems is more important than understanding of the models themselves. The models are (if I may say, merely) a tool to search and narrow down the factors. After that scientists can design experiments to further interrogate the complex system itself. As the ML model grows bigger, the interrogative space also grow. Depending on the goal, it may not be necessary to have a fully interpretable model as long as the features themselves help advancing the understanding of the complex biology system.
No neural networks aee stagnant on most key NLP tasks. While there has been some advances in cool tasks, the needed tasks for NLU are potently wintered.