
Generative AI can also produce robot actions—the core idea behind DeepMind’s Gemini Robotics project. The team has unveiled two new models that work together to let robots “think” before they act. Simulated reasoning has improved language models, and that advance may soon reach robotics.
DeepMind argues that generative AI is critical for robots because it enables broad, flexible functionality. Unlike today’s robots, which must be painstakingly trained for narrow tasks, generative systems could handle entirely new environments without reprogramming. DeepMind’s Carolina Parada noted that most robots are custom-built and take months to set up for a single task. Gemini Robotics instead uses a two-model approach: one for reasoning and one for execution.
Gemini Robotics 1.5 vs. Gemini Robotics-ER 1.5
The two models are called Gemini Robotics 1.5 and Gemini Robotics-ER 1.5. Gemini Robotics 1.5 is a vision-language-action (VLA) model that interprets visual and text input to produce robotic actions. Gemini Robotics-ER 1.5, where “ER” stands for embodied reasoning, is a vision-language model (VLM) that processes the same types of input but outputs step-by-step plans for completing complex tasks.
Gemini Robotics-ER 1.5 is the first robotics AI with simulated reasoning, scoring highly on tests for decision-making in physical environments. However, it doesn’t perform actions itself—that role is handled by Gemini Robotics 1.5.
For example, if you asked a robot to separate laundry into whites and colors, Gemini Robotics-ER 1.5 would analyze the request along with images of the clothing pile. It can also use external tools like Google Search to collect additional information. Based on this, the ER model produces natural language instructions—step-by-step directions the robot should follow to carry out the task.
Turning Instructions into Actions
Gemini Robotics 1.5, the action model, takes the step-by-step instructions from the ER model and translates them into robot movements, using visual input for guidance. It also engages in its own reasoning process to decide how to carry out each step. As DeepMind’s Kanishka Rao explained, humans rely on intuitive thoughts to complete tasks, but robots lack that intuition—so a key breakthrough with Gemini 1.5’s VLA is its ability to “think before it acts.”
DeepMind built both new robotics AIs on the Gemini foundation models and fine-tuned them with data for physical interaction. This design enables robots to handle more complex, multi-stage tasks, effectively giving them agent-like capabilities.
To test this system, DeepMind has deployed it on machines like the two-armed Aloha 2 and the humanoid Apollo. Unlike earlier approaches that required custom models for each robot, Gemini Robotics 1.5 can generalize across different embodiments—for example, transferring skills from Aloha 2’s grippers to Apollo’s more dexterous hands without special adjustments.
That said, practical household robots are still a distant goal. For now, only trusted testers can access Gemini Robotics 1.5, the model that controls physical machines. The ER model, however, is already available in Google AI Studio, giving developers the ability to generate robotic instructions for real-world experiments.
Read the original article on: Arstechnica
Read more: Bird-Like Robot with Novel Wings Achieves Self-Takeoff and Slow Flight
