|
|
|
|
|
by dlivingston
845 days ago
|
|
That's very cool. I have almost no experience with robotics, so excuse the silly questions: - How does it know what objects are? Does it use some sort of realtime object classifier neural net? What limitations are there here? - Does the robot know when it can't perform a request? I.e. if you ask it to move a large box or very heavy kettlebell? - How well does it do if the object is hidden or obscured? Does it go looking for it? What if it must move another object to get access to the requested one? |
|
You basically hit the nail on the head with these questions. This work is super cool, but you named a lot of the limitations with contemporary robot learning systems.
1. It's using an object classifier. It's described here (https://github.com/ok-robot/ok-robot/tree/main/ok-robot-navi...), but if I understanding it correctly basically they are using a ViT model (basically an image classification model) to do some labeling of images and projecting them onto a voxel grid. Then they are using language embeddings from CLIP to pair the language with the voxel grid. The limitations of this are that if they want this to run on the robot, they can't use the super huge versions of these models. While they could use a huge model on the cloud, that would introduce a lot of latency.
2. It almost certainly cannot identify invalid requests. There may be requests that are not covered by their language embeddings, in which case the robot would maybe do nothing. But it doesn't appear that this system has any knowledge of physics, other than the hardware limitations of the physical controller.
3. Hidden? Almost certainly wouldn't work. The voxel labeling relies on a module that labels the voxels and without visual info, it can't label them. Also, as far as I can tell, it doesn't appear to have very complex higher-order reasoning about, say, that a fork is in a drawer, which is in a kitchen, which is often in the back of a house. Partially obscured? That would be subject to the limitations of the visual classifier, so it might work. ViT is very good, but it probably depends on how obscured the object is.