Hacker News new | ask | show | jobs
by notahacker 2854 days ago
A human sees a 3D scene with objects, creatures, texture and lighting (and it evaluates the scene based on these concepts and how they related to each other even if it's never seen green fields, sheep, dry stone walls or fog before).

The computer generally sees a set of pixel values, and takes a plenty of training to distinguish between "sheep" and "field the same shade of green as usually found in images containing sheep" because it doesn't have an innate concept of animate objects and habitats, how they relate to each other and which is more important. Whilst the computer's busy seeing a white pile of stones as a false positive for the presence of sheep, the human's looking at the way the stones are piled as possible evidence of human activity and noting the presence of droppings in the foreground might mean sheep were here recently.

(of course, it's not entirely impossible for computer vision systems to deal with higher levels of abstraction: autonomous vehicles model the world in 3D and classify objects as vehicles in order to predict their near future behaviour and signals in order to regulate their own behaviour, but that goes well beyond mere learning processes. And of course a pixel-by-pixel understanding of the world has its use cases in spotting changes in colour and texture which are so subtle humans abstract away from, like crop discolouration on satellite images or cracks in rough surfaces)

We're much better at abstraction than other animals too: show us a 20,000 year old cave painting and we'll easily grasp that it was produced by humans and the lines represent shapes of animals broadly similar to today's livestock. Same goes for 2000 year old marble bas reliefs. You might well be able to train an algorithm to recognise "paintings of animals" and "carvings of animals", but you'll struggle with a training set consisting purely of photographs of real world livestock

2 comments

The human visual system is a lot more hacky than you might intuitively expect. It's really, really easy to mess with.

https://en.wikipedia.org/wiki/Optical_illusion

"A human sees a 3D scene objects, creatures, texture and lighting (and it evaluates the scene based on these concepts and how they related to each other even if it's never seen green fields, sheep, dry stone walls or fog before)."

Our eyes are no that different from cameras - they also have a set of pixels that can get some values, they are not that regular and maybe the values are not that discrete but it is not that retina sees objects or textures - there are just some neural layers that do the pixels->objects computation.

That's actually not at all how the eye works. We saccade a tiny spot around the scene based on our semantic intent (it's how, for example, you can see a hole to the sunny outside from inside a cave, while no camera will be able to manage the white balance). Then we have specific hardware doing feature extraction in the early visual system and feeding into the vision process.

Finally the semantic interpretation feeds deeply into the vision system. For example, though we have binocular vision you can only get stereopsis via parallax basically as far as you can reach -- after that you use semantic clues in the scene to understand that a barn is bigger than a person so that the person must be closer.

> Our eyes are no that different from cameras

Our eyes are plenty different. One above all they are driven by the neurons behind to scan the scene as the brain tries to figure out the details whereas neural network take whatever feed the camera captures passively.

See for example an owl head movements as it’s triangulating a prey’s distance.

There’s a lot more going on than just the vision part like a cascade of neural structures and not just a big uniform net, with region dedicated to detecting edges and understanding depth separated from and feeding into the classification region.

And we have structures to pick up differences from one scene to another somewhere, and dedicated neurons that react to changeand movement in a scene independently from the brain classification

Oh and it is also apparent that some superstructure does innate detection and supercedes learning, i.e. tests say mammals scared by serpents even if they were never exposed to one, while the same doesn’t happen with spiders, hinting serpent detection and fear is hardwired and not learned. Or ar least learned by evolution and not brain neurons’ plasticity.

The part about the movements - agreed (maybe we need to add this to the machines to improve them) - but the rest is just about additional layers - machines are not restricted to just one layer neither.
Yeah, obviously human visual inputs are a finite set of data points from rods and cones which might be considered roughly akin to pixels. But by "seeing" I'm clearly referring to what takes place in the visual cortex which is incredibly efficient at converting those inputs to geometry and objects/creatures/expressions with qualitative associations in a lossy manner, making heavy use of hardwired priors which are evolved rather than learned through evaluation against past sensory input (whilst at the same time apparently being entirely incapable of processing or storing the original sensory input values in a sufficiently discrete manner to replicate the pixel by pixel evaluation a computer vision system can achieve).
Nope, we gave two eyes and the ability to change focus. This essentially means we are dealing with video with added depth perception. Recent motion really pops out because we are comparing what we see with what we just saw.

Self driving cars with liar are a much better representation of human vision than a single image. We also do well with photos, but that’s a significant step down.