RT-2: Giving Robots Web Knowledge Through Vision-Language-Action Models

2025-01-01

Researchers at Google DeepMind have developed RT-2, a model that leverages internet-scale vision-language data to power robotic control. By representing robot actions as text tokens and co-fine-tuning state-of-the-art vision-language models with both robotic trajectory data and internet-scale vision-language tasks, RT-2 achieves remarkable generalization. It understands complex commands, performs multi-stage semantic reasoning, and even uses improvised tools, such as using a rock as a hammer. This research showcases the immense potential of combining large language model capabilities with robotic control, marking a significant leap forward in robotics.

Read more