A newer version of this model is available:
PaddlePaddle/PaddleOCR-VL
Mineral Nano 1 Vision
Mineral Nano 1 Vision is a compact, efficient vision-language model designed for fast inference and low-resource environments with multimodal capabilities.
Model Details
- Model Name: mineral-nano-1
- Model Type: Vision-Language Model (VLM)
- Parameters: ~110M parameters
- Context Length: 2048 tokens
- Architecture: Transformer-based decoder with vision encoder (12 layers)
- Precision: BFloat16
- Image Resolution: 224x224
- Modalities: Text + Images
Architecture
Language Model
- Hidden Size: 768
- Intermediate Size: 3072
- Attention Heads: 12
- Layers: 12
- Vocabulary Size: 32,000 tokens
- Positional Encoding: RoPE (Rotary Position Embeddings)
Vision Encoder
- Image Size: 224x224
- Patch Size: 16x16
- Hidden Size: 768
- Layers: 12
- Image Tokens: 196 per image
- Architecture: ViT-style encoder
Usage
Installation
pip install transformers pillow torch
Basic Image Understanding
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import requests
model_name = "Luke-Bergen/mineral-nano-1"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(model_name)
# Load an image
url = "https://example.com/image.jpg"
image = Image.open(requests.get(url, stream=True).raw)
# Prepare inputs
prompt = "<image>What is in this image?"
inputs = processor(text=prompt, images=image, return_tensors="pt")
# Generate response
outputs = model.generate(**inputs, max_new_tokens=100)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
Multiple Images
from PIL import Image
images = [
Image.open("image1.jpg"),
Image.open("image2.jpg")
]
prompt = "<image>Describe the first image. <image>Now describe the second image."
inputs = processor(text=prompt, images=images, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(outputs[0], skip_special_tokens=True))
Chat with Images
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What objects are in this image?"}
]
}
]
# Apply chat template
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=150, temperature=0.7)
print(processor.decode(outputs[0], skip_special_tokens=True))
Local Images
from PIL import Image
# Load local image
image = Image.open("path/to/your/image.jpg")
prompt = "<image>Describe what you see in detail."
inputs = processor(text=prompt, images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0], skip_special_tokens=True))
Training Details
- Framework: PyTorch with Transformers
- Training Data: Text + Image pairs
- Training Duration: [Specify training time]
- Hardware: [Specify GPUs used]
- Vision Encoder: Pretrained ViT encoder fine-tuned with language model
Capabilities
β Image description and captioning β Visual question answering β Object detection and recognition β Scene understanding β Multi-image reasoning β OCR and text extraction from images
Limitations
- Limited to 224x224 resolution images
- Context window of 2048 tokens including image tokens
- May struggle with fine-grained details
- Best for general image understanding tasks
- Compact size means reduced capabilities compared to larger VLMs
- Limited multilingual vision capabilities
Intended Use
This model is designed for:
- Educational purposes and learning VLM architectures
- Prototyping multimodal applications
- Low-resource deployment scenarios
- Fast inference with vision capabilities
- Mobile and edge device applications
- Personal projects requiring image understanding
Image Preprocessing
Images are automatically:
- Resized to 224x224
- Normalized with CLIP-style statistics
- Converted to RGB
- Split into 16x16 patches (196 total patches)
Performance Tips
- Use square images when possible for best results
- Ensure images are clear and well-lit
- Keep prompts concise and specific
- Use batch processing for multiple images
- Enable
use_cache=Truefor faster generation
License
[Specify your license - e.g., MIT, Apache 2.0, etc.]
Citation
@misc{mineral-nano-1-vision,
author = {Luke Bergen},
title = {Mineral Nano 1 Vision: A Compact Vision-Language Model},
year = {2025},
publisher = {HuggingFace},
url = {https://huggingface.co/Luke-Bergen/mineral-nano-1}
}
Contact
For questions or issues, please open an issue on the model repository.
Acknowledgments
This model builds upon research in vision transformers and multimodal learning.
- Downloads last month
- -
Model tree for Luke-Bergen/Mineral-Nano-1
Base model
deepseek-ai/DeepSeek-OCR