eo1-qwen2_5_vl / README.md

delinqu

Update README.md

334538f verified about 2 months ago

preview code

raw

history blame contribute delete

2.9 kB

metadata

license: mit
language:
  - en
base_model:
  - Qwen/Qwen2.5-3B-Instruct

EO-1 Vision-Language-Action Model (Initialization)

A pre-initialized vision-language-action model based on Qwen2.5-VL-3B-Instruct, specifically designed for recent Lerobot PR: https://github.com/huggingface/lerobot/pull/1971

🚀 Quick Start

from transformers import AutoProcessor, AutoModelForCausalLM

# Load the model and processor
model = AutoModelForCausalLM.from_pretrained("IPEC-COMMUNITY/eo1-qwen2_5_vl-initial", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("IPEC-COMMUNITY/eo1-qwen2_5_vl-initial", trust_remote_code=True)

# Ready for training - no additional setup required!

🎯 Key Features

Pre-configured Special Tokens: All EO-1 robotic tokens are pre-added to the vocabulary
Multimodal Processing: Integrated processor handles images, videos, text, robot states, and actions
Training-Ready: Directly loadable for fine-tuning without modifications
Based on Qwen2.5-VL-3B: Inherits strong vision-language understanding capabilities

🔧 Special Tokens

The model includes pre-configured special tokens for robotic manipulation:

Token	Purpose
`<\|action_start\|>`	Marks the beginning of action sequences
`<\|action_pad\|>`	Padding token for actions
`<\|action_pass\|>`	Pass-through token for actions
`<\|action_end\|>`	Marks the end of action sequences
`<\|state_start\|>`	Marks the beginning of state sequences
`<\|state_pad\|>`	Padding token for states
`<\|state_end\|>`	Marks the end of state sequences
`<\|vla\|>`	Vision-Language-Action task token

📊 Data Processing

The integrated processor handles multiple modalities:

Images: Automatically resized to adaptive pixels
Videos: Automatically resized to adaptive pixels
Text: Standard tokenization with special token support
Robot States: Vectorized and tokenized
Actions: Vectorized and tokenized with denoising support

🏗️ Model Architecture

Base Model: Qwen2.5-VL-3B-Instruct
Vision Encoder: Pre-trained vision transformer
Language Model: 3B parameter transformer
Action Projector: Custom layers for robotic action prediction
Flow Matching: Integrated denoising mechanism for action generation

💡 Usage Project

🤝 Contributing

For issues, questions, or contributions, please visit our GitHub repository.

Note: This is an initialization model. For best results, fine-tune on your specific robotic task data.