Real-Time Vision-Language Model Integration with Unitree Go2 Edu

Project Overview

This project enables the Unitree Go2 Edu quadruped robot to understand and describe its environment in real time using a Vision-Language Model (VLM). The system captures images from the robot’s front camera, processes them with Llava-7b, and generates a textual description of the scene. The results are displayed on a Flask-based web interface.

Key Features:

Live video streaming from the Unitree Go2 front camera.
Integration of the Ollama framework to run Llava-7b for real-time scene understanding.
Hand gesture-based robot control using MediaPipe (thumbs up/down to control robot standing posture).
Flask web server to display the live camera feed and generated descriptions.

How It Works

Capturing Real-Time Video:
- The VideoClient API from Unitree SDK retrieves images from the robot’s front camera.
- Images are processed in OpenCV and resized for efficient handling.
Scene Description Generation:
- The captured image is sent to Llava-7b using Ollama.
- The model generates a natural language description of the scene.
Hand Gesture Recognition for Robot Control:
- MediaPipe Hands** detects thumb gestures in the camera feed.
- If the thumbs up gesture is detected, the robot stands up.
- If the thumbs down gesture is detected, the robot sits down.
Web Interface for Streaming and Description Display:**
- A Flask server runs a webpage that streams the live video feed.
- The scene description is dynamically updated on the webpage.

System Architecture**

The project consists of three main components:

1️⃣ Video Processing Module

Retrieves images using Unitree SDK’s VideoClient.
Preprocesses frames using OpenCV.
Runs hand gesture recognition using MediaPipe.

2️⃣ Vision-Language Model (VLM) Integration

Sends images to Llava-7b via Ollama.
Processes model-generated text responses and updates the description.

3️⃣ Web-Based Interface**

Flask server hosts a video feed (/video_feed).
The scene description is updated via a dedicated API endpoint (/description).
Users can view the robot’s observations in real-time.

Demonstration

Below is a screenshot of the project in action, showing the robot’s real-time video feed alongside the generated description.

VLM Integration with Unitree Go2 Edu

GitHub Repository

The complete implementation is available at:
🔗 GitHub Repository