Real-Time Vision-Language Model Integration with Unitree Go2 Edu


Project Overview


This project enables the Unitree Go2 Edu quadruped robot to understand and describe its environment in real time using a Vision-Language Model (VLM). The system captures images from the robot’s front camera, processes them with Llava-7b, and generates a textual description of the scene. The results are displayed on a Flask-based web interface.

Key Features:



How It Works


  1. Capturing Real-Time Video:
    • The VideoClient API from Unitree SDK retrieves images from the robot’s front camera.
    • Images are processed in OpenCV and resized for efficient handling.
  2. Scene Description Generation:
    • The captured image is sent to Llava-7b using Ollama.
    • The model generates a natural language description of the scene.
  3. Hand Gesture Recognition for Robot Control:
    • MediaPipe Hands** detects thumb gestures in the camera feed.
    • If the thumbs up gesture is detected, the robot stands up.
    • If the thumbs down gesture is detected, the robot sits down.
  4. Web Interface for Streaming and Description Display:**
    • A Flask server runs a webpage that streams the live video feed.
    • The scene description is dynamically updated on the webpage.


System Architecture**


The project consists of three main components:

1️⃣ Video Processing Module


2️⃣ Vision-Language Model (VLM) Integration


3️⃣ Web-Based Interface**



Demonstration


Below is a screenshot of the project in action, showing the robot’s real-time video feed alongside the generated description.


VLM Integration with Unitree Go2 Edu

GitHub Repository


The complete implementation is available at:
🔗 GitHub Repository