Hacker News new | ask | show | jobs
by gxigzigxigxi 2777 days ago
To be fair to the computers, that’s about the speed we train image classifiers to run at, if not faster. It seems likely that if we were willing to tolerate a second or two of latency we might be able to devise architectures that are able to see through some of the images that confuse current architectures.
2 comments

The problem is not that we're not willing to tolerate latency. The problem is that the model of how neural networks are trained is completely different from how humans learn to see. When a neural net is trained, it is shown a static image, weights are tweaked until the output is correct, and then it is shown a completely different static image and the process is repeated. Neural net learning is iterative, discrete, and supervised. Human visual learning, by contrast, is continuous and largely unsupervised. We don't see snapshots, we see continuously varying images. Furthermore, we actively interact with the world by manipulating objects and shifting our gaze, and that information is also incorporated into our visual learning. Finally, humans have very advanced feature detectors built in to our brains by evolution. We don't learn to see cats, we have cat-detectors built in to our brains by our DNA, which learned to detect cats because that was a useful skill in our ancestral environment, when cats were a lot bigger and could eat us. We do learn that the thing that our cat-detectors detect is called a "cat", but we don't "learn" what a cat (or a human) looks like. That's built in to our brain wiring. (There are some things that we do learn what they look like, like cars, which obviously didn't exist in the ancestral environment. That's why all humans can tell the difference between a cat and a dog, but not everyone can identify whether something is a Honda or a Toyota.)

The point is: the process that humans go through when they learn is completely different than the process that contemporary neural nets go through. No one has yet come up with a theory that combines all of the features of human learning into an implementable algorithm. It will surely happen eventually, but there are at least a few more conceptual breakthroughs that will need to happen. Minor tweaks to back-propagation won't do it.

Do you have a citation on humans having from birth classifieds for things like cats?
I think the GP didn't mean to imply that humans have an innate cat representation from birth.

It makes more sense to interpret the comment as saying that humans don't learn an internal image representation. Humans do learn representations of bridges, aircraft, cats, etc. But those are built on top of an image processing/representation system that we are born with, analogous to raster graphics?

Edit: Maybe I'm misreading the comment. What's definitely built in at birth is things like edge and orientation detectors. A zebra detector would be a surprise.

> I think the GP didn't mean to imply that humans have an innate cat representation from birth.

Actually, that is what I meant, though I don't have a reference for cat detectors per se. But there is ample evidence for innate feature detectors of comparable complexity (e.g. human facial expressions), even if the actual target is something other than cats.

I don't know about cats, but the ability of newborns to identify human faces is well documented. The details don't really matter. Whether it's cats or something else, we humans come with some very sophisticated feature-detectors hard-wired in.
There are obviously classes of objects that we don't have hard-wired detectors for. The face is one of the few that I've heard claimed as wired from birth.

But I think "feature detectors" are exactly what the earlier comment was referring to, e.g. a Gabor-wavelet-style decomposition of the retinal image. Deep learning systems have to learn those; we're born with them.

> Gabor-wavelet-style decomposition of the retinal image

Well, that's one theory. But I think it will turn out to be a lot more complex than that. One thing that I haven't seen anyone pay much attention to is feature detectors in the time domain, which we clearly have. We notice movement as a fundamental feature. Our movement detectors can actually be triggered by static images [1]. One of the ways we distinguish dogs from cats (I believe) is by the way they move. It would be a very interesting experiment to use CGI to make a dog move like a cat and vice versa and see how those are perceived.

[1] http://www.psy.ritsumei.ac.jp/~akitaoka/ICP2016.html

It's not just the time. Computers don't have higher reasoning that can cross-check the results with expectations from a more complex world model. So by just flashing images I guess you only activate some earlier parts of the visual processing machinery. So that's quite fair to the computers since their NNs only do primitive pattern checking without reasoning too.