Thanks! Our main goal was to build a vacuum which understands semantics inside the house so that it can "clean the kitchen" or "clean the bedroom" so we wanted to do machine learning and since we were doing machine learning we were like why not try to do something E2E instead of first doing SLAM, optical flow etc..
If you capture a video and SLAM map of the whole space, you could use some VQA model like cosmos reason offline to extract key points and descriptions. Maybe even plan the route offline for the open ended task like “clean kitchen”. Then load the route and all you need is localization and obstacle avoidance
Aaah interesting, does stuff like this generalise to furniture moving around and different lighting conditions and stuff? Also sounds like if the route gets blocked it just wont move