FlameF0X commited on
Commit
9bd8339
·
verified ·
1 Parent(s): c7ffc29

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +183 -1
README.md CHANGED
@@ -9,4 +9,186 @@ pipeline_tag: image-text-to-text
9
  library_name: transformers
10
  tags:
11
  - merge
12
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  library_name: transformers
10
  tags:
11
  - merge
12
+ ---
13
+
14
+ # N2-Eye: Multimodal Conversational AI
15
+
16
+ N2-Eye is a multimodal language model that combines the power of LiquidAI's LFM2-1.2B language model with OpenAI's CLIP vision encoder to enable image understanding and conversation capabilities.
17
+
18
+ ## Model Details
19
+
20
+ - **Base Language Model**: LiquidAI/LFM2-1.2B (1.26B parameters)
21
+ - **Vision Encoder**: OpenAI CLIP-ViT-Base-Patch32
22
+ - **Model Type**: Image-Text-to-Text (Multimodal Conversational)
23
+ - **Training Dataset**: CRAG-MM Multi-Turn Public Dataset
24
+ - **License**: MIT
25
+ - **Framework**: PyTorch + Transformers
26
+
27
+ ## Architecture
28
+
29
+ N2-Eye uses a modular architecture that combines:
30
+
31
+ 1. **Language Model**: LFM2-1.2B for text generation and conversation
32
+ 2. **Vision Encoder**: CLIP for image understanding (frozen during training)
33
+ 3. **Projection Layer**: A trainable MLP that maps CLIP features to the language model's embedding space
34
+
35
+ The model processes images by:
36
+ - Encoding images with CLIP to extract visual features
37
+ - Projecting these features through a learnable projection layer
38
+ - Integrating projected features into the language model at special `<image>` token positions
39
+
40
+ ## Training Details
41
+
42
+ ### Dataset
43
+ - **Source**: CRAG-MM Multi-Turn Public Dataset (v0.1.1)
44
+ - **Format**: Multi-turn conversations with images
45
+ - **Preprocessing**: Conversations formatted with ChatML-style tokens
46
+
47
+ ### Training Configuration
48
+ - **Batch Size**: 2 per device (with gradient accumulation steps: 4)
49
+ - **Learning Rate**: 2e-5
50
+ - **Training Length**: 1 epoch on validation split
51
+ - **Precision**: bfloat16
52
+ - **Max Sequence Length**: 2048 tokens
53
+ - **Optimization**: Gradient checkpointing enabled
54
+
55
+ ### Special Tokens
56
+ - `<image>`: Placeholder for image embeddings in conversation
57
+ - System prompt: "You are a helpful assistant trained by Liquid AI. You can see and understand images."
58
+
59
+ ## Usage
60
+
61
+ ### Basic Inference
62
+
63
+ ```python
64
+ import torch
65
+ from transformers import AutoTokenizer, AutoModelForCausalLM, CLIPProcessor
66
+ from PIL import Image
67
+
68
+ # Load components
69
+ tokenizer = AutoTokenizer.from_pretrained("GoofyLM/N2-Eye")
70
+ clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
71
+
72
+ # Load the multimodal model (requires custom loading due to architecture)
73
+ # See the training code for complete loading implementation
74
+
75
+ # Prepare conversation
76
+ conversation = """<|im_start|>system
77
+ You are a helpful assistant trained by Liquid AI. You can see and understand images.<|im_end|>
78
+ <image>
79
+ <|im_start|>user
80
+ What do you see in this image?<|im_end|>
81
+ <|im_start|>assistant
82
+ """
83
+
84
+ # Process inputs
85
+ text_inputs = tokenizer(conversation, return_tensors="pt")
86
+ image = Image.open("your_image.jpg")
87
+ image_inputs = clip_processor(images=image, return_tensors="pt")
88
+
89
+ # Generate response
90
+ with torch.no_grad():
91
+ outputs = model.generate(
92
+ input_ids=text_inputs.input_ids,
93
+ attention_mask=text_inputs.attention_mask,
94
+ images=image_inputs.pixel_values,
95
+ max_new_tokens=150,
96
+ do_sample=True,
97
+ temperature=0.7
98
+ )
99
+
100
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
101
+ print(response)
102
+ ```
103
+
104
+ ### Chat Template
105
+
106
+ N2-Eye uses ChatML format for conversations:
107
+
108
+ ```
109
+ <|im_start|>system
110
+ You are a helpful assistant trained by Liquid AI. You can see and understand images.<|im_end|>
111
+ <image>
112
+ <|im_start|>user
113
+ {user_message}<|im_end|>
114
+ <|im_start|>assistant
115
+ {assistant_response}<|im_end|>
116
+ ```
117
+
118
+ ## Capabilities
119
+
120
+ N2-Eye can:
121
+ - Understand and describe images in detail
122
+ - Answer questions about visual content
123
+ - Engage in multi-turn conversations that reference images
124
+ - Combine visual and textual information for comprehensive responses
125
+
126
+ ## Limitations
127
+
128
+ - **Image Token Handling**: Requires specific placement of `<image>` tokens in conversation format
129
+ - **Single Image**: Currently optimized for single image per conversation
130
+ - **Training Scale**: Trained on a limited dataset (validation split only)
131
+ - **Frozen Vision**: CLIP encoder is frozen, limiting adaptation to new visual domains
132
+
133
+ ## Technical Implementation
134
+
135
+ ### Model Architecture Classes
136
+
137
+ The implementation includes several key components:
138
+
139
+ 1. **MultimodalLFM2Model**: Main model class combining language and vision
140
+ 2. **CRAGMMDataset**: Dataset handler for CRAG-MM format
141
+ 3. **MultimodalTrainer**: Custom trainer for multimodal inputs
142
+
143
+ ### Key Features
144
+
145
+ - **Gradient Checkpointing**: Memory-efficient training
146
+ - **Custom Collation**: Handles multimodal batch processing
147
+ - **Flexible Image Integration**: Dynamic matching of image features to token positions
148
+ - **Safe Serialization**: Custom saving to handle shared tensors
149
+
150
+ ## Requirements
151
+
152
+ ```
153
+ torch
154
+ transformers
155
+ datasets
156
+ Pillow
157
+ clip-by-openai
158
+ ```
159
+
160
+ ## Training Your Own Version
161
+
162
+ To retrain or fine-tune N2-Eye:
163
+
164
+ 1. Install dependencies
165
+ 2. Prepare your dataset in CRAG-MM format
166
+ 3. Modify configuration in the training script
167
+ 4. Run the training pipeline
168
+
169
+ See the included training script for complete implementation details.
170
+
171
+ ## Citation
172
+
173
+ If you use N2-Eye in your research, please cite:
174
+
175
+ ```bibtex
176
+ @misc{n2eye2025,
177
+ title={N2-Eye: Multimodal Conversational AI},
178
+ author={GoofyLM Lab},
179
+ year={2025},
180
+ publisher={Hugging Face},
181
+ howpublished={\url{https://huggingface.co/GoofyLM/N2-Eye}}
182
+ }
183
+ ```
184
+
185
+ ## Acknowledgments
186
+
187
+ - **LiquidAI** for the LFM2-1.2B base model
188
+ - **OpenAI** for the CLIP vision encoder
189
+ - **CRAG-MM** dataset contributors for training data
190
+ - **Hugging Face** for the transformers library and model hosting
191
+
192
+ ## License
193
+
194
+ This model is released under the MIT License. See the LICENSE file for details.