eo1-qwen2_5_vl / README.md
delinqu's picture
Update README.md
334538f verified
metadata
license: mit
language:
  - en
base_model:
  - Qwen/Qwen2.5-3B-Instruct

EO-1 Vision-Language-Action Model (Initialization)

A pre-initialized vision-language-action model based on Qwen2.5-VL-3B-Instruct, specifically designed for recent Lerobot PR: https://github.com/huggingface/lerobot/pull/1971

πŸš€ Quick Start

from transformers import AutoProcessor, AutoModelForCausalLM

# Load the model and processor
model = AutoModelForCausalLM.from_pretrained("IPEC-COMMUNITY/eo1-qwen2_5_vl-initial", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("IPEC-COMMUNITY/eo1-qwen2_5_vl-initial", trust_remote_code=True)

# Ready for training - no additional setup required!

🎯 Key Features

  • Pre-configured Special Tokens: All EO-1 robotic tokens are pre-added to the vocabulary
  • Multimodal Processing: Integrated processor handles images, videos, text, robot states, and actions
  • Training-Ready: Directly loadable for fine-tuning without modifications
  • Based on Qwen2.5-VL-3B: Inherits strong vision-language understanding capabilities

πŸ”§ Special Tokens

The model includes pre-configured special tokens for robotic manipulation:

Token Purpose
<|action_start|> Marks the beginning of action sequences
<|action_pad|> Padding token for actions
<|action_pass|> Pass-through token for actions
<|action_end|> Marks the end of action sequences
<|state_start|> Marks the beginning of state sequences
<|state_pad|> Padding token for states
<|state_end|> Marks the end of state sequences
<|vla|> Vision-Language-Action task token

πŸ“Š Data Processing

The integrated processor handles multiple modalities:

  • Images: Automatically resized to adaptive pixels
  • Videos: Automatically resized to adaptive pixels
  • Text: Standard tokenization with special token support
  • Robot States: Vectorized and tokenized
  • Actions: Vectorized and tokenized with denoising support

πŸ—οΈ Model Architecture

  • Base Model: Qwen2.5-VL-3B-Instruct
  • Vision Encoder: Pre-trained vision transformer
  • Language Model: 3B parameter transformer
  • Action Projector: Custom layers for robotic action prediction
  • Flow Matching: Integrated denoising mechanism for action generation

πŸ’‘ Usage Project

🀝 Contributing

For issues, questions, or contributions, please visit our GitHub repository.


Note: This is an initialization model. For best results, fine-tune on your specific robotic task data.