A human sees a 3D scene with objects, creatures, texture and lighting (and it evaluates the scene based on these concepts and how they related to each other even if it's never seen green fields, sheep, dry stone walls or fog before).
The computer generally sees a set of pixel values, and takes a plenty of training to distinguish between "sheep" and "field the same shade of green as usually found in images containing sheep" because it doesn't have an innate concept of animate objects and habitats, how they relate to each other and which is more important. Whilst the computer's busy seeing a white pile of stones as a false positive for the presence of sheep, the human's looking at the way the stones are piled as possible evidence of human activity and noting the presence of droppings in the foreground might mean sheep were here recently.
(of course, it's not entirely impossible for computer vision systems to deal with higher levels of abstraction: autonomous vehicles model the world in 3D and classify objects as vehicles in order to predict their near future behaviour and signals in order to regulate their own behaviour, but that goes well beyond mere learning processes. And of course a pixel-by-pixel understanding of the world has its use cases in spotting changes in colour and texture which are so subtle humans abstract away from, like crop discolouration on satellite images or cracks in rough surfaces)
We're much better at abstraction than other animals too: show us a 20,000 year old cave painting and we'll easily grasp that it was produced by humans and the lines represent shapes of animals broadly similar to today's livestock. Same goes for 2000 year old marble bas reliefs. You might well be able to train an algorithm to recognise "paintings of animals" and "carvings of animals", but you'll struggle with a training set consisting purely of photographs of real world livestock
"A human sees a 3D scene objects, creatures, texture and lighting (and it evaluates the scene based on these concepts and how they related to each other even if it's never seen green fields, sheep, dry stone walls or fog before)."
Our eyes are no that different from cameras - they also have a set of pixels that can get some values, they are not that regular and maybe the values are not that discrete but it is not that retina sees objects or textures - there are just some neural layers that do the pixels->objects computation.
That's actually not at all how the eye works. We saccade a tiny spot around the scene based on our semantic intent (it's how, for example, you can see a hole to the sunny outside from inside a cave, while no camera will be able to manage the white balance). Then we have specific hardware doing feature extraction in the early visual system and feeding into the vision process.
Finally the semantic interpretation feeds deeply into the vision system. For example, though we have binocular vision you can only get stereopsis via parallax basically as far as you can reach -- after that you use semantic clues in the scene to understand that a barn is bigger than a person so that the person must be closer.
Our eyes are plenty different. One above all they are driven by the neurons behind to scan the scene as the brain tries to figure out the details whereas neural network take whatever feed the camera captures passively.
See for example an owl head movements as it’s triangulating a prey’s distance.
There’s a lot more going on than just the vision part like a cascade of neural structures and not just a big uniform net, with region dedicated to detecting edges and understanding depth separated from and feeding into the classification region.
And we have structures to pick up differences from one scene to another somewhere, and dedicated neurons that react to changeand movement in a scene independently from the brain classification
Oh and it is also apparent that some superstructure does innate detection and supercedes learning, i.e. tests say mammals scared by serpents even if they were never exposed to one, while the same doesn’t happen with spiders, hinting serpent detection and fear is hardwired and not learned. Or ar least learned by evolution and not brain neurons’ plasticity.
The part about the movements - agreed (maybe we need to add this to the machines to improve them) - but the rest is just about additional layers - machines are not restricted to just one layer neither.
Yeah, obviously human visual inputs are a finite set of data points from rods and cones which might be considered roughly akin to pixels. But by "seeing" I'm clearly referring to what takes place in the visual cortex which is incredibly efficient at converting those inputs to geometry and objects/creatures/expressions with qualitative associations in a lossy manner, making heavy use of hardwired priors which are evolved rather than learned through evaluation against past sensory input (whilst at the same time apparently being entirely incapable of processing or storing the original sensory input values in a sufficiently discrete manner to replicate the pixel by pixel evaluation a computer vision system can achieve).
Nope, we gave two eyes and the ability to change focus. This essentially means we are dealing with video with added depth perception. Recent motion really pops out because we are comparing what we see with what we just saw.
Self driving cars with liar are a much better representation of human vision than a single image. We also do well with photos, but that’s a significant step down.
The AI sees data/markers/patterns that look like something it's seen before, as opposed to actually comprehending that it sees a tube of meat that people call a hot dog.
The best metaphor I can think of is the cognitive difference between navigating a transit station that has signs in your native language, and one that you spent a couple of hours learning on Duolingo - with the latter, you aren't really understanding anything, just associating a:b::x:y.
If every action is the same -- that is, if you produce some actions which would have been produced if you "conceptualized" it rather than merely "memorized" it -- isn't that identical?
The only thing we can do in life is make decisions. Regardless of how they're derived, if those decisions are identical to yours, isn't that entity "you" in some sense?
>if those decisions are identical to yours, isn't that entity "you" in some sense?
If by "decisions" you mean every single nerve impulse in response to every possible set of stimuli, then that's pretty exacting. Every wobble while standing, every mouth movement answering any possible question, etc.
Also, how do you determine if the responses are "identical?" It's not like we can rewind reality and play it back, substituting yourself for an AI. And due to quantum nondeterminism, even if you played it back with no substitution your actions will diverge over time! If you're not considered identical to yourself, how is that a useful definition/test of "identicality"?
At the required fidelity, this thought-experiment is problematic both in theory and in practice. It obscures more than it illuminates imo.
Transit is probably not extreme enough because a:b::x:y is fine.
Figures of speech are probably a better example (at least if translated literally, Duolingo teaches the equivalent phrases and is therefore easy to forget it doesn’t teach the meaning).
“Der Tropfen, der das Fass zum Überlaufen brachte”. What is the origin story that makes the English equivalent about camels, anyway?
I don't know what "conceptually" means on such a level that I can program it. But it probably has something to do with a network of differing representations we have for a hotdog. A low-level visual cortex representation (probably not too much different from artificial NNs). Representation as parts arranged in particular spacial order. A word. Related representations, like a process of making a hotdog. And so on.
The computer generally sees a set of pixel values, and takes a plenty of training to distinguish between "sheep" and "field the same shade of green as usually found in images containing sheep" because it doesn't have an innate concept of animate objects and habitats, how they relate to each other and which is more important. Whilst the computer's busy seeing a white pile of stones as a false positive for the presence of sheep, the human's looking at the way the stones are piled as possible evidence of human activity and noting the presence of droppings in the foreground might mean sheep were here recently.
(of course, it's not entirely impossible for computer vision systems to deal with higher levels of abstraction: autonomous vehicles model the world in 3D and classify objects as vehicles in order to predict their near future behaviour and signals in order to regulate their own behaviour, but that goes well beyond mere learning processes. And of course a pixel-by-pixel understanding of the world has its use cases in spotting changes in colour and texture which are so subtle humans abstract away from, like crop discolouration on satellite images or cracks in rough surfaces)
We're much better at abstraction than other animals too: show us a 20,000 year old cave painting and we'll easily grasp that it was produced by humans and the lines represent shapes of animals broadly similar to today's livestock. Same goes for 2000 year old marble bas reliefs. You might well be able to train an algorithm to recognise "paintings of animals" and "carvings of animals", but you'll struggle with a training set consisting purely of photographs of real world livestock