---
license: apache-2.0
datasets:
- openai/gdpval
metrics:
- accuracy
base_model:
- deepseek-ai/DeepSeek-OCR
- tencent/HunyuanImage-3.0
- zai-org/GLM-4.6
new_version: PaddlePaddle/PaddleOCR-VL
pipeline_tag: text-to-image
library_name: fasttext
tags:
- agent
- legal
---
# Mineral Nano 1 Vision

Mineral Nano 1 Vision is a compact, efficient vision-language model designed for fast inference and low-resource environments with multimodal capabilities.

## Model Details

- **Model Name:** mineral-nano-1
- **Model Type:** Vision-Language Model (VLM)
- **Parameters:** ~110M parameters
- **Context Length:** 2048 tokens
- **Architecture:** Transformer-based decoder with vision encoder (12 layers)
- **Precision:** BFloat16
- **Image Resolution:** 224x224
- **Modalities:** Text + Images

## Architecture

### Language Model
- **Hidden Size:** 768
- **Intermediate Size:** 3072
- **Attention Heads:** 12
- **Layers:** 12
- **Vocabulary Size:** 32,000 tokens
- **Positional Encoding:** RoPE (Rotary Position Embeddings)

### Vision Encoder
- **Image Size:** 224x224
- **Patch Size:** 16x16
- **Hidden Size:** 768
- **Layers:** 12
- **Image Tokens:** 196 per image
- **Architecture:** ViT-style encoder

## Usage

### Installation

```bash
pip install transformers pillow torch
```

### Basic Image Understanding

```python
from transformers import AutoProcessor, AutoModelForVision2Seq
from PIL import Image
import requests

model_name = "Luke-Bergen/mineral-nano-1"
processor = AutoProcessor.from_pretrained(model_name)
model = AutoModelForVision2Seq.from_pretrained(model_name)

# Load an image
url = "https://example.com/image.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Prepare inputs
prompt = "<image>What is in this image?"
inputs = processor(text=prompt, images=image, return_tensors="pt")

# Generate response
outputs = model.generate(**inputs, max_new_tokens=100)
response = processor.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### Multiple Images

```python
from PIL import Image

images = [
    Image.open("image1.jpg"),
    Image.open("image2.jpg")
]

prompt = "<image>Describe the first image. <image>Now describe the second image."
inputs = processor(text=prompt, images=images, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=200)
print(processor.decode(outputs[0], skip_special_tokens=True))
```

### Chat with Images

```python
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What objects are in this image?"}
        ]
    }
]

# Apply chat template
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=150, temperature=0.7)
print(processor.decode(outputs[0], skip_special_tokens=True))
```

### Local Images

```python
from PIL import Image

# Load local image
image = Image.open("path/to/your/image.jpg")

prompt = "<image>Describe what you see in detail."
inputs = processor(text=prompt, images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0], skip_special_tokens=True))
```

## Training Details

- **Framework:** PyTorch with Transformers
- **Training Data:** Text + Image pairs
- **Training Duration:** [Specify training time]
- **Hardware:** [Specify GPUs used]
- **Vision Encoder:** Pretrained ViT encoder fine-tuned with language model

## Capabilities

✅ Image description and captioning
✅ Visual question answering
✅ Object detection and recognition
✅ Scene understanding
✅ Multi-image reasoning
✅ OCR and text extraction from images

## Limitations

- Limited to 224x224 resolution images
- Context window of 2048 tokens including image tokens
- May struggle with fine-grained details
- Best for general image understanding tasks
- Compact size means reduced capabilities compared to larger VLMs
- Limited multilingual vision capabilities

## Intended Use

This model is designed for:
- Educational purposes and learning VLM architectures
- Prototyping multimodal applications
- Low-resource deployment scenarios
- Fast inference with vision capabilities
- Mobile and edge device applications
- Personal projects requiring image understanding

## Image Preprocessing

Images are automatically:
- Resized to 224x224
- Normalized with CLIP-style statistics
- Converted to RGB
- Split into 16x16 patches (196 total patches)

## Performance Tips

- Use square images when possible for best results
- Ensure images are clear and well-lit
- Keep prompts concise and specific
- Use batch processing for multiple images
- Enable `use_cache=True` for faster generation

## License

[Specify your license - e.g., MIT, Apache 2.0, etc.]

## Citation

```bibtex
@misc{mineral-nano-1-vision,
  author = {Luke Bergen},
  title = {Mineral Nano 1 Vision: A Compact Vision-Language Model},
  year = {2025},
  publisher = {HuggingFace},
  url = {https://huggingface.co/Luke-Bergen/mineral-nano-1}
}
```

## Contact

For questions or issues, please open an issue on the model repository.

## Acknowledgments

This model builds upon research in vision transformers and multimodal learning.