Hacker News new | ask | show | jobs
by fishbotics 844 days ago
Disclaimer: I'm not one of the authors, but I work in this area.

You basically hit the nail on the head with these questions. This work is super cool, but you named a lot of the limitations with contemporary robot learning systems.

1. It's using an object classifier. It's described here (https://github.com/ok-robot/ok-robot/tree/main/ok-robot-navi...), but if I understanding it correctly basically they are using a ViT model (basically an image classification model) to do some labeling of images and projecting them onto a voxel grid. Then they are using language embeddings from CLIP to pair the language with the voxel grid. The limitations of this are that if they want this to run on the robot, they can't use the super huge versions of these models. While they could use a huge model on the cloud, that would introduce a lot of latency.

2. It almost certainly cannot identify invalid requests. There may be requests that are not covered by their language embeddings, in which case the robot would maybe do nothing. But it doesn't appear that this system has any knowledge of physics, other than the hardware limitations of the physical controller.

3. Hidden? Almost certainly wouldn't work. The voxel labeling relies on a module that labels the voxels and without visual info, it can't label them. Also, as far as I can tell, it doesn't appear to have very complex higher-order reasoning about, say, that a fork is in a drawer, which is in a kitchen, which is often in the back of a house. Partially obscured? That would be subject to the limitations of the visual classifier, so it might work. ViT is very good, but it probably depends on how obscured the object is.

2 comments

The cool thing is that there are solutions to all of these problems, if the more basic problems can be solved more reliably to prove the underlying technology works.
> While they could use a huge model on the cloud, that would introduce a lot of latency.

Will all the recent work to make gen. AI faster (see groq for LLM & fal.ai for stable diffusion), I wonder if the latency will become low enough to make this a non-issue or at least good enough

If AI/ML home systems become significantly common for consumers before the onboard technology is capable, I could see home cacheing appliances for LLMs.

Like something that sits next to your router (or more likely, routers that come stock with it).

Does a robot that moves things in a home need this? The challenging decisions are (off the top of my head):

1. what am i picking up? - this can be AI in the cloud as it does not need to be real time

2. how do i pick it up? - this can be AI in the cloud as it does not need to be real time - the robot can take its time picking the object up

3. after pickup, where do i put the object? localization while moving probably needs to be done locally but identifying where to put down can be done via cloud, again, no rush

4. how do put the object down? again, the robot can take its time

You can see in the video the robot pauses before performing the actions after finding the object in its POV, so real time isn't a hard req for a lot of these