“Scaling Robot Learning with Vision-Language-Action Models”
ABSTRACTThe last several years have witnessed tremendous progress in the capabilities of AI systems, driven largely by foundation models that scale expressive architectures with diverse data sources. While the impact of this technology on vision and language understanding is abundantly clear, its use in robotics remains in its infancy. Scaling robot learning still presents numerous open challenges—from selecting the right data to scale, to developing algorithms that can effectively fit this data for closed-loop operation in the physical world. At Physical Intelligence, we aim to tackle these questions. This talk will present our recent work on building vision-language-action models, covering topics such as architecture design, data scaling, and open research directions.
PresenterSuraj Nair is a founding researcher at Physical Intelligence (Pi), where he focuses on scaling data-driven robotic learning. His research centers on pre-training generalist models for robotics, constructing large-scale robot datasets, and leveraging internet-scale video and language data to advance robotic learning. Prior to Pi, he was a Research Scientist at the Toyota Research Institute and completed his PhD in Computer Science at the Stanford AI Lab, advised by Professors Chelsea Finn and Silvio Savarese. He holds a Bachelor’s degree in Computer Science from Caltech. His work has been recognized with various paper awards and nominations at venues including RSS, CoRL, and ICRA.