Google’s DeepMind has announced Robotics Transformer 2 (RT-2), a first-of-its-kind vision-language-action (VLA) model that can enable robots to perform novel tasks without specific training.
Just like how language models learn general ideas and concepts from web-scale data, RT-2 uses text and images from the web to understand different real-world concepts and translate that knowledge into generalized instructions for robotic actions.
When improved, this technology can lead to context-aware, adaptable robots that could perform different tasks in different situations and environments — with far less training than currently required.
What makes DeepMind’s RT-2 unique?
Back in 2022, DeepMind debuted RT-1, a multi-task model that trained on 130,000 demonstrations and enabled Everyday Robots to perform 700-plus tasks with a 97% success rate. Now, using the robotic demonstration data from RT-1 with web datasets, the company has trained the successor of the model: RT-2.
The biggest highlight of RT-2 is that, unlike RT-1 and other models, it does not require hundreds of thousands of data points to get a robot to work. Organizations have long found specific robot training (covering every single object, environment and situation) critical to handling complex, abstract tasks in highly variable environments.
However, in this case, RT-2 learns from a small amount of robotic data to perform the complex reasoning seen in foundation models and transfer the knowledge acquired to direct robotic actions – even for tasks it’s never seen or been trained to do before.
“RT-2 shows improved generalization capabilities and semantic and visual understanding beyond the robotic data it was exposed to,” Google explains. This includes interpreting new commands and responding to user commands by performing rudimentary reasoning, such as reasoning about object categories or high-level descriptions.”
Taking action without training
According to Vincent Vanhoucke, head of robotics at Google DeepMind, training a robot to throw away trash previously meant explicitly training the robot to identify trash, as well as pick it up and throw it away.
But with RT-2, which is trained on web data, there’s no need for that. The model already has a general idea of what trash is and can identify it without explicit training. It even has an idea of how to throw away the trash, even though it’s never been trained to take that action.
When dealing with seen tasks in internal tests, RT-2 performed just as well as RT-1. However, for novel, unseen scenarios, its performance almost doubled performance to 62% from RT-1’s 32%.
When advanced, vision-language-action models like RT-2 can lead to context-aware robots that could reason, problem-solve and interpret information for performing a diverse range of actions in the real world depending on the situation at hand.
For instance, instead of robots performing the same repeated actions in a warehouse, enterprises could see machines that could handle each object differently, considering factors like the object’s type, weight, fragility and other factors.
According to Markets and Markets, the segment of AI-driven robotics is expected to grow from $6.9 billion in 2021 to $35.3 billion in 2026, an expected CAGR of 38.6%.
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Discover our Briefings.