Hacker News new | ask | show | jobs
by visarga 3067 days ago
It's not just a matter of detecting objects and locating them. The deeper computer vision problem is to identify object attributes, relations between objects and actions in video. It's much harder to do that because many relations appear in very diverse situations, with objects of different categories, so it's hard to have 1000's of examples for each class of relation.

For example, humans can identify a monkey riding a Segway on the airport runway, but there probably is no such thing in the training set, even if it is quite large. The neural net might not know if that constitutes a "riding" action because it has never seen such a combination. Maybe the monkey is jumping over the thing and the picture shows it in proximity to it, not riding it - a human would know that a slight gap means there is no riding taking place.

Then, the even harder problem is to predict the consequences of actions on objects and just to physically simulate the scene. Such knowledge is useful in robot action planning. Beyond computer vision, there is also a need to create a "mental simulator" that has theory of mind and can simulate other agents (what humans intend), and we need simulators, both physical and mental to create the next level of AI.