Using a 3D scene segmentation [1] to yield object hypotheses that are subsequently labeled by a simple NN classifier, the robot system can talk about objects and their properties (color, size, elongation, position). Ambigue references to objects will be resolved in an interactive dialogue asking for the most informative object property in a given situation. Ultimately pointing gestures can be used to resolve a reference. The robot system is able to pick and place objects to a new target location (which might be changing as well), to hand over an object to the user, and to talk about the current scene state.
[1] A. Ukermann, R. Haschke, and H. Ritter, "Realtime 3D segmentation for human-robot interaction," in Proc. IROS, 2013, pp. 2136--2143.