One way to have a robot perform a task is through demonstration. Another is by giving a verbal instruction in any language.
When asked to bring a can, the robot understands the request, identifies the correct can, navigates to the counter, searches for a matching item, picks it up, and brings it back to the person. This sequence combines language understanding, scene perception, navigation, and manipulation into a single coherent behavior. *No teleoperation is used.*
This approach significantly lowers the barrier to human-robot collaboration, enabling non-experts to deploy humanoid robots effectively in real-world environments.