EO-1 Vision-Language-Action Model (Initialization)

A pre-initialized vision-language-action model based on Qwen2.5-VL-3B-Instruct, specifically designed for recent Lerobot PR: https://github.com/huggingface/lerobot/pull/1971

πŸš€ Quick Start

from transformers import AutoProcessor, AutoModelForCausalLM

# Load the model and processor
model = AutoModelForCausalLM.from_pretrained("IPEC-COMMUNITY/eo1-qwen2_5_vl-initial", trust_remote_code=True)
processor = AutoProcessor.from_pretrained("IPEC-COMMUNITY/eo1-qwen2_5_vl-initial", trust_remote_code=True)

# Ready for training - no additional setup required!

🎯 Key Features

  • Pre-configured Special Tokens: All EO-1 robotic tokens are pre-added to the vocabulary
  • Multimodal Processing: Integrated processor handles images, videos, text, robot states, and actions
  • Training-Ready: Directly loadable for fine-tuning without modifications
  • Based on Qwen2.5-VL-3B: Inherits strong vision-language understanding capabilities

πŸ”§ Special Tokens

The model includes pre-configured special tokens for robotic manipulation:

Token Purpose
<|action_start|> Marks the beginning of action sequences
<|action_pad|> Padding token for actions
<|action_pass|> Pass-through token for actions
<|action_end|> Marks the end of action sequences
<|state_start|> Marks the beginning of state sequences
<|state_pad|> Padding token for states
<|state_end|> Marks the end of state sequences
<|vla|> Vision-Language-Action task token

πŸ“Š Data Processing

The integrated processor handles multiple modalities:

  • Images: Automatically resized to adaptive pixels
  • Videos: Automatically resized to adaptive pixels
  • Text: Standard tokenization with special token support
  • Robot States: Vectorized and tokenized
  • Actions: Vectorized and tokenized with denoising support

πŸ—οΈ Model Architecture

  • Base Model: Qwen2.5-VL-3B-Instruct
  • Vision Encoder: Pre-trained vision transformer
  • Language Model: 3B parameter transformer
  • Action Projector: Custom layers for robotic action prediction
  • Flow Matching: Integrated denoising mechanism for action generation

πŸ’‘ Usage Project

🀝 Contributing

For issues, questions, or contributions, please visit our GitHub repository.


Note: This is an initialization model. For best results, fine-tune on your specific robotic task data.

Downloads last month
48
Safetensors
Model size
4B params
Tensor type
F32
Β·
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for IPEC-COMMUNITY/eo1-qwen2_5_vl

Base model

Qwen/Qwen2.5-3B
Finetuned
(805)
this model