| That's a rather poor choice for an example considering Gemini Robotics-ER is built on a tuned version of Gemini, which is itself an LLM. And while the action model is impressive, the actual "reasoning" here is still being handled by an LLM. From the paper [0]: > Gemini Robotics 1.5 model family. Both Gemini Robotics 1.5 and Gemini Robotics-ER 1.5 inherit Gemini’s multimodal world knowledge. > Agentic System Architecture. The full agentic system consists of an orchestrator and an action model that are implemented by the VLM and the VLA, respectively: > • Orchestrator: The orchestrator processes user input and environmental feedback and controls the overall task flow. It breaks complex tasks into simpler steps that can be executed by the VLA, and it performs success detection to decide when to switch to the next step. To accomplish a user-specified task, it can leverage digital tools to access external information or perform additional reasoning steps. We use GR-ER 1.5 as the orchestrator. > • Action model: The action model translates instructions issued by the orchestrator into low-level robot actions. It is made available to the orchestrator as a specialized tool and receives instructions via open-vocabulary natural language. The action model is implemented by the GR 1.5 model. AI researchers have been trying to discover workable architectures for decades, and LLMs are the best we've got so far. There is no reason to believe that this exponential growth on test scores would or even could transfer to other architectures. In fact, the core advantage that LLMs have here is that they can be trained on vast, vast amounts of text scraped from the internet and taken from pirated books. Other model architectures that don't involve next-token-prediction cannot be trained using that same bottomless data source, and trying to learn quickly from real-world experiences is still a problem we haven't solved. [0] https://storage.googleapis.com/deepmind-media/gemini-robotic... |