| The demo is quite interesting but I am mostly intrigued by the claim that it is running totally local to each robot. It seems to use some agentic decision making but the article doesn't touch on that. What possible combo of model types are they stringing together? Or is this something novel? The article mentions that the system in each robot uses two ai models. S2 is built on a 7B-parameter open-source, open-weight VLM pretrained on internet-scale data
and the other S1, an 80M parameter cross-attention encoder-decoder transformer, handles low-level [motor?] control.
It feels like although the article is quite openly technical they are leaving out the secret sauce? So they use an open source VLM to identify the objects on the counter. And another model to generate the mechanical motions of the robot.What part of this system understands 3 dimensional space of that kitchen? How does the robot closest to the refrigerator know to pass the cookies to the robot on the left? How is this kind of speech to text, visual identification, decision making, motor control, multi-robot coordination and navigation of 3d space possible locally? Figure robots, each equipped with dual low-power-consumption embedded GPUs
Is anyone skeptical? How much of this is possible vs a staged tech demo to raise funding? |
1. S2 is a 7B VLM, it is responsible for taken in camera streams (from however many of them), run through prompt guided text generation, and before the lm_head (or a few layers leading to it), directly take the latent encoding;
2. S1 is where they collected a few hundreds hours of teleoperating data, retrospectively come up with prompt for 1, then train from the scratch;
Whether S2 finetuned with S1 or not is an open question, at least there is a MLP adapter that is finetuned, but could be the whole 7B VLM is finetuned too.
It looks plausible, but I am still skeptical about the generalization claim given it is all fine-tuned with household tasks. But nowadays, it is really difficult to understand how these models generalize.