--- license: mit language: - en base_model: - Qwen/Qwen2.5-3B-Instruct --- # EO-1 Vision-Language-Action Model (Initialization) A pre-initialized vision-language-action model based on Qwen2.5-VL-3B-Instruct, specifically designed for recent Lerobot PR: https://github.com/huggingface/lerobot/pull/1971 ## πŸš€ Quick Start ```python from transformers import AutoProcessor, AutoModelForCausalLM # Load the model and processor model = AutoModelForCausalLM.from_pretrained("IPEC-COMMUNITY/eo1-qwen2_5_vl-initial", trust_remote_code=True) processor = AutoProcessor.from_pretrained("IPEC-COMMUNITY/eo1-qwen2_5_vl-initial", trust_remote_code=True) # Ready for training - no additional setup required! ``` ## 🎯 Key Features - **Pre-configured Special Tokens**: All EO-1 robotic tokens are pre-added to the vocabulary - **Multimodal Processing**: Integrated processor handles images, videos, text, robot states, and actions - **Training-Ready**: Directly loadable for fine-tuning without modifications - **Based on Qwen2.5-VL-3B**: Inherits strong vision-language understanding capabilities ## πŸ”§ Special Tokens The model includes pre-configured special tokens for robotic manipulation: | Token | Purpose | |-------|---------| | `<\|action_start\|>` | Marks the beginning of action sequences | | `<\|action_pad\|>` | Padding token for actions | | `<\|action_pass\|>` | Pass-through token for actions | | `<\|action_end\|>` | Marks the end of action sequences | | `<\|state_start\|>` | Marks the beginning of state sequences | | `<\|state_pad\|>` | Padding token for states | | `<\|state_end\|>` | Marks the end of state sequences | | `<\|vla\|>` | Vision-Language-Action task token | ## πŸ“Š Data Processing The integrated processor handles multiple modalities: - **Images**: Automatically resized to adaptive pixels - **Videos**: Automatically resized to adaptive pixels - **Text**: Standard tokenization with special token support - **Robot States**: Vectorized and tokenized - **Actions**: Vectorized and tokenized with denoising support ## πŸ—οΈ Model Architecture - **Base Model**: Qwen2.5-VL-3B-Instruct - **Vision Encoder**: Pre-trained vision transformer - **Language Model**: 3B parameter transformer - **Action Projector**: Custom layers for robotic action prediction - **Flow Matching**: Integrated denoising mechanism for action generation ## πŸ’‘ Usage Project - [πŸ€—Lerobot: https://github.com/huggingface/lerobot/tree/main/src/lerobot/policies/eo1](https://github.com/huggingface/lerobot/tree/main/src/lerobot/policies/eo1) - [πŸš€EO-1: https://github.com/EO-Robotics/EO-1](https://github.com/EO-Robotics/EO-1) ## 🀝 Contributing For issues, questions, or contributions, please visit our [GitHub repository](https://github.com/EO-Robotics/EO-1). --- **Note**: This is an initialization model. For best results, fine-tune on your specific robotic task data.