Hacker News new | ask | show | jobs
by hyperion2010 3261 days ago
My own view of this having spent some time in visual neuroscience is that if you really want vision that is robust to these kinds of issues then you have to build a geometric representation of the world first, and then learn/map categories from that. Trying to jump from a matrix to a label without having an intervening topological/geometric model of the world in between (having 2 eyes and/or the ability to move and help with this) is asking for trouble because we think we are recapitulating biology when in fact we are doing nothing of the sort (as these adversarial examples reveal beautifully).
2 comments

Some disjointed thoughts.

> we think we are recapitulating biology when in fact we are doing nothing of the sort (as these adversarial examples reveal beautifully).

I'm not sure I'd go so far. There's a pretty long list of optical illusions. Seeing motion where there clearly is none, not comparing distances correctly and most relevant here is things that look like a face. Here are a few selected famous examples: http://brainden.com/face-illusions.htm

Some of those immediately make my brain flag up "FACE". It's only looking in more detail that I see what else is there, but my visual system is clearly being tricked, as would billions of other completely independently grown visual systems. How much better could we do this, and with more subtlety, if we could analyse the whole brain like we can neural networks and target a specific brain?

There's an old experiment showing how a kitten raised never seeing horizontal lines will fail to see them ever after a certain age, so we know that biological systems struggle with limited visual input.

I'd also say we're doing matrix -> label conversions ourselves, too, unless we're born with a special geometric model. Deep learning also does things in layers, so there's not a direct matrix-label learning happening straight away, that should come much later after the system has learned to create a higher level representation of the input.

On a less contrarian side, I wonder how well these things would work if we were to show the networks videos of... everything. Years and years of video. Don't try and add labels yet, but can we add a constraint that we expect the representation to only change slowly? Two very similar frames should not result in the high-level interpretation changing drastically.

I think we should refrain from saying we are recapitulating biology until we have reached the point where the machine systems tend to succeed AND fail in the SAME ways that the biological systems do.
We tried that; the reason deep nets are popular is that they outperform geometric (or other problem-specific) models. This might be because they implicitly develop such representations somewhere along the way, or because such a representation is not really necessary for visual classification.

Additionally, introducing ancillary modules is not without cost-- you might gain robustness to some kinds of adversarial inputs at the expense of becoming vulnerable to others. There's plenty of ways to fool biological visual systems: c.f. magic-eye posters, optical illusions, or the various exploits described in Lettvin and Pitts' paper "What the Frog's Eye Tells the Frog's Brain".

> outperform

I think that remains to be seen, at least in the general case, since we haven't yet agreed on a measure of performance. The debate around adversarial examples can be interpreted as arguing over the proper measure of performance. Although so far the debate is doing so somewhat implicitly, since afaik nobody has formalized a measure of robustness to adversarial examples; it's progressed more by case studies (which is fine, since research into NN robustness is still quite early stage, and case studies can help illustrate issues). I think it can be fairly said that neural nets perform well on the ImageNet benchmark and similar measures of performance. But whether those are good measures of performance, or whether some kind of metric that weights robustness more heavily should be used (and what methods would perform well on that) is the subject of current research, like this research.

have we tried using DL to construct 3d space, using either video or 3d images? That seems closer to what humans do than just the 2d images