A New Method Lets Robots Map a Scene and Identify Objects to Complete Tasks
Picture tidying up a cluttered kitchen, beginning with a counter scattered with sauce packets. If your aim is to clean the counter, you might gather all the packets at once. But if you want to separate the mustard packets first, you’d sort them by type. And if you were specifically looking for Grey Poupon mustard, you’d need to search even more carefully to find that exact brand.
MIT engineers have developed a method that enables robots to make intuitive, task-specific decisions. Their new system, called Clio, allows a robot to identify the important parts of a scene based on its assigned tasks. Clio processes a list of tasks in natural language, determining the necessary level of detail to interpret its surroundings and “remember” only the relevant aspects.
In tests, Clio was used in environments like a cluttered cubicle and a five-story building, where the robot segmented scenes based on tasks such as “move rack of magazines” and “get first aid kit.” The system was also tested on a quadruped robot in real-time as it explored an office building, recognizing only objects related to its task, such as retrieving a dog toy while ignoring office supplies.
A Versatile Tool for Task-Specific Robotics
Named after the Greek muse of history for its ability to remember key elements, Clio is designed for use in various environments, including search and rescue, domestic tasks, and factory work. According to Luca Carlone, associate professor in MIT’s Department of Aeronautics and Astronautics, Clio helps robots understand their surroundings and focus on what’s necessary to complete their mission.
The team presents their findings in a study published today in the journal IEEE Robotics and Automation Letters. Carlone’s co-authors include SPARK Lab members Dominic Maggio, Yun Chang, Nathan Hughes, and Lukas Schmid, as well as MIT Lincoln Laboratory researchers Matthew Trang, Dan Griffith, Carlyn Dougherty, and Eric Cristofalo.
Transitioning from Closed-Set to Open-Set Object Recognition
Advances in computer vision and natural language processing have enabled robots to identify objects, but this was previously limited to controlled “closed-set” environments with predefined objects. Recently, researchers have adopted an “open-set” approach, using deep learning to train neural networks on billions of images and text. These networks can now recognize new objects in unfamiliar scenes. However, a challenge remains in determining how to segment a scene in a task-relevant way. As Maggio notes, the level of detail should vary depending on the robot’s task to create a useful map.
With Clio, the MIT team designed robots to interpret their surroundings with detail that adjusts automatically to the task. For instance, if the task is to move a stack of books, the robot should recognize the entire stack, while it should identify just a green book when that’s the focus.
Integrating Computer Vision and Language Models for Enhanced Object Recognition
The approach combines advanced computer vision and large language models, using neural networks trained on millions of images and text. They also employ mapping tools that segment images, which the neural network analyzes for relevance.
By applying the “information bottleneck” concept, they compress image data to keep only the segments relevant to the task, allowing the robot to focus on the necessary items.
Clio was tested in real-world environments, such as Maggio’s cluttered apartment, where it quickly identified relevant segments for tasks like “move pile of clothes.” The system was also used in real-time on Boston Dynamics’ Spot robot, which mapped and identified objects in an office.
This method generated maps highlighting only the target objects, enabling the robot to complete tasks efficiently. Running Clio in real-time was a major advancement, as prior methods required hours for processing.
Looking ahead, the team plans to enhance Clio to handle more complex tasks, like “find survivors” or “restore power,” moving closer to a human-like understanding of tasks.
Read the original article on: TechXplore
Read more: Robotic Arm 3D-Prints Two-Story House