| HN Mirror

I'm very far from an expert, but:

  What part of this system understands 3 dimensional space of that kitchen?

The visual model "understands" it most readily, I'd say -- like a traditional Waymo CNN "understands" the 3D space of the road. I don't think they've explicitly given the models a pre-generated pointcloud of the space, if that's what you're asking. But maybe I'm misunderstanding?

  How does the robot closest to the refrigerator know to pass the cookies to the robot on the left?

It appears that the robot is being fed plain english instructions, just like any VLM would -- instead of the very common `text+av => text` paradigm (classifiers, perception models, etc), or the less common `text+av => av` paradigm (segmenters, art generators, etc.), this is `text+av => movements`.

Feeding the robots the appropriate instructions at the appropriate time is a higher-level task than is covered by this demo, but I think is pretty clearly doable with existing AI techniques (/a loop).

  How is this kind of speech to text, visual identification, decision making, motor control, multi-robot coordination and navigation of 3d space possible locally?

If your question is "where's the GPUs", their "AI" marketing page[1] pretty clearly implies that compute is offloaded, and that only images and instructions are meaningfully "on board" each robot. I could see this violating the understanding of "totally local" that you mentioned up top, but IMHO those claims are just clarifying that the individual figures aren't controlled as one robot -- even if they ultimately employ the same hardware. Each period (7Hz?) two sets of instructions are generated.

[1] https://www.figure.ai/ai

  What possible combo of model types are they stringing together? Or is this something novel?

Again, I don't work in robotics at all, but have spent quite a while cataloguing all the available foundational models, and I wouldn't describe anything here as "totally novel" on the model level. Certainly impressive, but not, like, a theoretical breakthrough. Would love for an expert to correct me if I'm wrong, tho!

EDIT: Oh and finally:

  Is anyone skeptical? How much of this is possible vs a staged tech demo to raise funding?

Surely they are downplaying the difficulties of getting this setup perfectly, and don't show us how many bad runs it took to get these flawless clips.

They are seeking to raise their valuation from ~$3B to ~$40B this month, sooooooo take that as you will ;)

https://www.reuters.com/technology/artificial-intelligence/r...