A newer version of this model is available: PaddlePaddle/PaddleOCR-VL

Mineral Nano 1 Vision

Mineral Nano 1 Vision is a compact, efficient vision-language model designed for fast inference and low-resource environments with multimodal capabilities.

Model Details

Model Name: mineral-nano-1
Model Type: Vision-Language Model (VLM)
Parameters: ~110M parameters
Context Length: 2048 tokens
Architecture: Transformer-based decoder with vision encoder (12 layers)
Precision: BFloat16
Image Resolution: 224x224
Modalities: Text + Images

Architecture

Language Model

Hidden Size: 768
Intermediate Size: 3072
Attention Heads: 12
Layers: 12
Vocabulary Size: 32,000 tokens
Positional Encoding: RoPE (Rotary Position Embeddings)

Vision Encoder

Image Size: 224x224
Patch Size: 16x16
Hidden Size: 768
Layers: 12
Image Tokens: 196 per image
Architecture: ViT-style encoder

Usage

Installation

pip install transformers pillow torch

Basic Image Understanding

from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import requests

model_name = "Luke-Bergen/mineral-nano-1"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(model_name)

# Load an image
url = "https://example.com/image.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Prepare inputs
prompt = "<image>What is in this image?"
inputs = processor(text=prompt, images=image, return_tensors="pt")

# Generate response
outputs = model.generate(**inputs, max_new_tokens=100)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)

Multiple Images

from PIL import Image

images = [
    Image.open("image1.jpg"),
    Image.open("image2.jpg")
]

prompt = "<image>Describe the first image. <image>Now describe the second image."
inputs = processor(text=prompt, images=images, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(outputs[0], skip_special_tokens=True))

Chat with Images

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What objects are in this image?"}
        ]
    }
]

# Apply chat template
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=150, temperature=0.7)
print(processor.decode(outputs[0], skip_special_tokens=True))

Local Images

from PIL import Image

# Load local image
image = Image.open("path/to/your/image.jpg")

prompt = "<image>Describe what you see in detail."
inputs = processor(text=prompt, images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0], skip_special_tokens=True))

Training Details

Framework: PyTorch with Transformers
Training Data: Text + Image pairs
Training Duration: [Specify training time]
Hardware: [Specify GPUs used]
Vision Encoder: Pretrained ViT encoder fine-tuned with language model

Capabilities

✅ Image description and captioning ✅ Visual question answering ✅ Object detection and recognition ✅ Scene understanding ✅ Multi-image reasoning ✅ OCR and text extraction from images

Limitations

Limited to 224x224 resolution images
Context window of 2048 tokens including image tokens
May struggle with fine-grained details
Best for general image understanding tasks
Compact size means reduced capabilities compared to larger VLMs
Limited multilingual vision capabilities

Intended Use

This model is designed for:

Educational purposes and learning VLM architectures
Prototyping multimodal applications
Low-resource deployment scenarios
Fast inference with vision capabilities
Mobile and edge device applications
Personal projects requiring image understanding

Image Preprocessing

Images are automatically:

Resized to 224x224
Normalized with CLIP-style statistics
Converted to RGB
Split into 16x16 patches (196 total patches)

Performance Tips

Use square images when possible for best results
Ensure images are clear and well-lit
Keep prompts concise and specific
Use batch processing for multiple images
Enable use_cache=True for faster generation

License

[Specify your license - e.g., MIT, Apache 2.0, etc.]

Citation

@misc{mineral-nano-1-vision,
  author = {Luke Bergen},
  title = {Mineral Nano 1 Vision: A Compact Vision-Language Model},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Luke-Bergen/mineral-nano-1}
}

Contact

For questions or issues, please open an issue on the model repository.

Acknowledgments

This model builds upon research in vision transformers and multimodal learning.

Downloads last month: -

Safetensors

Model size

3B params

Tensor type

F32

Model tree for Luke-Bergen/Mineral-Nano-1

Base model

deepseek-ai/DeepSeek-OCR

Finetuned

(14)

this model

Luke-Bergen
/

Mineral-Nano-1