Look Before You Leap: Unveiling the Power of
GPT-4V in Robotic Vision-Language Planning

Yingdong Hu1, 2, 3*, Fanqi Lin1, 2, 3*, Tong Zhang1, 2, 3, Li Yi1, 2, 3, Yang Gao1, 2, 3
1Tsinghua University, 2Shanghai Artificial Intelligence Laboratory, 3Shanghai Qi Zhi Institute
*Equal Contribution


ViLa is a simple and effective method for long-horizon robotic task planning. By directly integrating vision into the reasoning and planning process, ViLa can solve a variety of complex, long-horizon tasks both in real-world and simulated settings.


ViLa demonstrates the capability to solve a wide array of real-world, everyday manipulation tasks in a zero-shot manner,
efficiently handling diverse open-set instructions and objects.

Abstract

In this study, we are interested in imbuing robots with the capability of physically-grounded task planning. Recent advancements have shown that large language models (LLMs) possess extensive knowledge useful in robotic tasks, especially in reasoning and planning. However, LLMs are constrained by their lack of world grounding and dependence on external affordance models to perceive environmental information, which cannot jointly reason with LLMs. We argue that a task planner should be an inherently grounded, unified multimodal system. To this end, we introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning that leverages vision-language models (VLMs) to generate a sequence of actionable steps. ViLa directly integrates perceptual data into its reasoning and planning process, enabling a profound understanding of commonsense knowledge in the visual world, including spatial layouts and object attributes. It also supports flexible multimodal goal specification and naturally incorporates visual feedback. Our extensive evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners, highlighting its effectiveness in a wide array of open-world manipulation tasks.





Video

ViLa

Given a language instruction and current visual observation, we leverage a VLM (GPT-4V) to comprehend the environment scene through chain-of-thought reasoning, subsequently generating a sequence of actionable steps. The first step of this plan is then executed by a primitive policy. Finally, the step that has been executed is added to the finished plan, enabling a closed-loop planning method in dynamic environments.



Comprehension of Common Sense in the Visual World

ViLa excels in complex tasks that demand an understanding of spatial layouts or object attributes. This kind of commonsense knowledge pervades nearly every task of interest in robotics, but previous LLM-based planners consistently fall short in this regard.

Spatial Layouts
Object Attributes

Versatile Goal Specifications

ViLa supports flexible multimodal goal specification approaches. It is capable of utilizing not just language instrctions but also diverse forms of goal images, and even a blend of both language and images, to define objectives effectively.

Tasks

Description of the image

Goal Image

Goal Type: Real Image

Visual Feedback

ViLa effectively utilizes visual feedback in an intuitive and natural way, enabling robust closed-loop planning in dynamic environments.

Stack Blocks

Pack Chip Bags

Find Cat Food

Human-Robot Interaction

Simulation Experiments

We show that ViLa can rearrange the objects on the table in some desired configuration, specified by high-level language instructions.

Blocks & Bowls
Letters

Stack all the blocks

Put the letters on the tables in alphabetical orders

BibTeX

@article{hu2023look,
      title={Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning},
      author={Yingdong Hu and Fanqi Lin and Tong Zhang and Li Yi and Yang Gao},
      journal={arXiv preprint arXiv:2311.17842},
      year={2023}
    }