Though, One trivial way to do it, with NNs in any case, is just to project forward from a range of observer models and guess the observer parameters from them.
This is still the wrong sense of generalisation. What cant be guessed is why a person took consecutive pictures at given angles/etc.
Such information is necessary to resolve deep ambiguities in cases where your observer model will fail.
Eg., yesterday i looked out my window and thought i saw two people;
it was actually one with a shadow+bag.
I moved my eyes/head/body in such away so as to fit a variety of models and i was able to 'read the scene' in the end.
And there's no reason we couldn't have a deep learning system where the input data (images) included time-stamps and movement vectors, and it could be good both at easy image classification, and at choosing particular "head movements" like those you performed, to help resolve ambiguous cases.
Further food for thought: these ambiguous cases seem (do you agree?) to be very rare.
Ambiguity is the norm, it isn't rare. Almost all visual input, ie., light, is ambiguous. We (animals) use the history of our prior geometrical-light experiences (ie., walking around) to use environmental cues to resolve ambiguity.
That billions (, trillions) of images are needed to aproximate what we can do for a single instance, i think is a good guide to the magnitude of the problem.
Google the "amnes room" -- that "illusion" is how we are always seeing.
Though, One trivial way to do it, with NNs in any case, is just to project forward from a range of observer models and guess the observer parameters from them.
This is still the wrong sense of generalisation. What cant be guessed is why a person took consecutive pictures at given angles/etc.
Such information is necessary to resolve deep ambiguities in cases where your observer model will fail.
Eg., yesterday i looked out my window and thought i saw two people; it was actually one with a shadow+bag.
I moved my eyes/head/body in such away so as to fit a variety of models and i was able to 'read the scene' in the end.
That is comprehension.