Robots that collaborate with humans must be able to identify objects used for shared tasks, for example tools such as a knife for assistance at cooking, or parts such as a screw on a factory floor. Humans communicate about objects using language and gesture, fusing information from multiple modalities over time. Existing work has addressed this problem in single modalities, such as natural language or gesture, or fused modalities in non-realtime systems, but a gap remains in creating systems that simultaneously fuse information from language and gesture over time. To address this problem, we define a multimodal Bayes’ filter for interpreting referring expressions to objects. Our approach outputs a distribution over the referent object at 14Hz, updating dynamically as it receives new observations of the person’s spoken words and gestures. This real-time update enables a robot to dynamically respond with backchannel feedback while a person is still communi- cating, pointing toward a mathematical framework for human- robot communication as a joint activity [Clark, 1996]. Moreover, our approach takes into account rich timing information in the language as words are spoken by processing incremental output from the speech recognition system, traditionally ignored when processing a command as an entire sentence. It quickly adapts when the person refers to a new object. We collected a new dataset of people referring to objects in a tabletop setting and demonstrate that our approach is able to infer the correct object with 90% accuracy. Additionally, we demonstrate that our approach enables a Baxter robot to provide back-channel responses in real-time.
Социальные закладки