PaLM-E is a decoder-only LLM that generates textual completions autoregressively given a prefix or prompt. It combines the power of visual models like ViT with language models like PaLM. It is an embodied LM built by injecting multi-modal information such as images into the embedding space of a pre-trained LLM.
PaLM-E is a single general-purpose multimodal language model for embodied reasoning tasks, visual-language tasks, and language tasks. PaLM-E transfers knowledge from visual-language domains into embodied reasoning – from robot planning in environments with complex dynamics and physical constraints, to answering questions about the observable world. PaLM-E-562B can do zero-shot multimodal chain-of-thought reasoning, can tell visually-conditioned jokes given an image, and demonstrates an array of robot-relevant multimodal-informed capabilities including perception, visually-grounded dialogue, and planning. PaLM-E also generalizes, zero-shot, to multi-image prompts despite only being trained on single-image prompts. PaLM-E can also perform math given an image with textually-interleaved handwritten numbers. In addition, the model can perform, zero-shot, question and answering on temporally-annotated egocentric vision.
Here is the agenda for this video:
00:00:00 What is PaLM-E?
00:03:34 What is the overall architecture of PaLM-E?
00:06:54 What is the input format for PaLM-E?
00:11:30 How are PaLM-E models trained?
00:16:03 How does PaLM-E perform on Task and Motion Planning (TAMP)?
00:22:27 How does PaLM-E perform on a table-top pushing environment?
00:28:05 How does PaLM-E perform in mobile manipulation domain?
00:32:45 How does PaLM-E perform on General Visual-Language Tasks, and General Lang Tasks?