A newer version of this model is available: PaddlePaddle/PaddleOCR-VL

Mineral Nano 1 Vision

Mineral Nano 1 Vision is a compact, efficient vision-language model designed for fast inference and low-resource environments with multimodal capabilities.

Model Details

  • Model Name: mineral-nano-1
  • Model Type: Vision-Language Model (VLM)
  • Parameters: ~110M parameters
  • Context Length: 2048 tokens
  • Architecture: Transformer-based decoder with vision encoder (12 layers)
  • Precision: BFloat16
  • Image Resolution: 224x224
  • Modalities: Text + Images

Architecture

Language Model

  • Hidden Size: 768
  • Intermediate Size: 3072
  • Attention Heads: 12
  • Layers: 12
  • Vocabulary Size: 32,000 tokens
  • Positional Encoding: RoPE (Rotary Position Embeddings)

Vision Encoder

  • Image Size: 224x224
  • Patch Size: 16x16
  • Hidden Size: 768
  • Layers: 12
  • Image Tokens: 196 per image
  • Architecture: ViT-style encoder

Usage

Installation

pip install transformers pillow torch

Basic Image Understanding

from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import requests

model_name = "Luke-Bergen/mineral-nano-1"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(model_name)

# Load an image
url = "https://example.com/image.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Prepare inputs
prompt = "<image>What is in this image?"
inputs = processor(text=prompt, images=image, return_tensors="pt")

# Generate response
outputs = model.generate(**inputs, max_new_tokens=100)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)

Multiple Images

from PIL import Image

images = [
    Image.open("image1.jpg"),
    Image.open("image2.jpg")
]

prompt = "<image>Describe the first image. <image>Now describe the second image."
inputs = processor(text=prompt, images=images, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(outputs[0], skip_special_tokens=True))

Chat with Images

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What objects are in this image?"}
        ]
    }
]

# Apply chat template
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=150, temperature=0.7)
print(processor.decode(outputs[0], skip_special_tokens=True))

Local Images

from PIL import Image

# Load local image
image = Image.open("path/to/your/image.jpg")

prompt = "<image>Describe what you see in detail."
inputs = processor(text=prompt, images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0], skip_special_tokens=True))

Training Details

  • Framework: PyTorch with Transformers
  • Training Data: Text + Image pairs
  • Training Duration: [Specify training time]
  • Hardware: [Specify GPUs used]
  • Vision Encoder: Pretrained ViT encoder fine-tuned with language model

Capabilities

βœ… Image description and captioning βœ… Visual question answering βœ… Object detection and recognition βœ… Scene understanding βœ… Multi-image reasoning βœ… OCR and text extraction from images

Limitations

  • Limited to 224x224 resolution images
  • Context window of 2048 tokens including image tokens
  • May struggle with fine-grained details
  • Best for general image understanding tasks
  • Compact size means reduced capabilities compared to larger VLMs
  • Limited multilingual vision capabilities

Intended Use

This model is designed for:

  • Educational purposes and learning VLM architectures
  • Prototyping multimodal applications
  • Low-resource deployment scenarios
  • Fast inference with vision capabilities
  • Mobile and edge device applications
  • Personal projects requiring image understanding

Image Preprocessing

Images are automatically:

  • Resized to 224x224
  • Normalized with CLIP-style statistics
  • Converted to RGB
  • Split into 16x16 patches (196 total patches)

Performance Tips

  • Use square images when possible for best results
  • Ensure images are clear and well-lit
  • Keep prompts concise and specific
  • Use batch processing for multiple images
  • Enable use_cache=True for faster generation

License

[Specify your license - e.g., MIT, Apache 2.0, etc.]

Citation

@misc{mineral-nano-1-vision,
  author = {Luke Bergen},
  title = {Mineral Nano 1 Vision: A Compact Vision-Language Model},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Luke-Bergen/mineral-nano-1}
}

Contact

For questions or issues, please open an issue on the model repository.

Acknowledgments

This model builds upon research in vision transformers and multimodal learning.

Downloads last month
-
Safetensors
Model size
3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Luke-Bergen/Mineral-Nano-1

Finetuned
(14)
this model

Dataset used to train Luke-Bergen/Mineral-Nano-1