|
Disclaimer: I'm not one of the authors, but I work in this area. You basically hit the nail on the head with these questions. This work is super cool, but you named a lot of the limitations with contemporary robot learning systems. 1. It's using an object classifier. It's described here (https://github.com/ok-robot/ok-robot/tree/main/ok-robot-navi...), but if I understanding it correctly basically they are using a ViT model (basically an image classification model) to do some labeling of images and projecting them onto a voxel grid. Then they are using language embeddings from CLIP to pair the language with the voxel grid. The limitations of this are that if they want this to run on the robot, they can't use the super huge versions of these models. While they could use a huge model on the cloud, that would introduce a lot of latency. 2. It almost certainly cannot identify invalid requests. There may be requests that are not covered by their language embeddings, in which case the robot would maybe do nothing. But it doesn't appear that this system has any knowledge of physics, other than the hardware limitations of the physical controller. 3. Hidden? Almost certainly wouldn't work. The voxel labeling relies on a module that labels the voxels and without visual info, it can't label them. Also, as far as I can tell, it doesn't appear to have very complex higher-order reasoning about, say, that a fork is in a drawer, which is in a kitchen, which is often in the back of a house. Partially obscured? That would be subject to the limitations of the visual classifier, so it might work. ViT is very good, but it probably depends on how obscured the object is. |