--- license: apache-2.0 datasets: - openai/gdpval metrics: - accuracy base_model: - deepseek-ai/DeepSeek-OCR - tencent/HunyuanImage-3.0 - zai-org/GLM-4.6 new_version: PaddlePaddle/PaddleOCR-VL pipeline_tag: text-to-image library_name: fasttext tags: - agent - legal --- # Mineral Nano 1 Vision Mineral Nano 1 Vision is a compact, efficient vision-language model designed for fast inference and low-resource environments with multimodal capabilities. ## Model Details - **Model Name:** mineral-nano-1 - **Model Type:** Vision-Language Model (VLM) - **Parameters:** ~110M parameters - **Context Length:** 2048 tokens - **Architecture:** Transformer-based decoder with vision encoder (12 layers) - **Precision:** BFloat16 - **Image Resolution:** 224x224 - **Modalities:** Text + Images ## Architecture ### Language Model - **Hidden Size:** 768 - **Intermediate Size:** 3072 - **Attention Heads:** 12 - **Layers:** 12 - **Vocabulary Size:** 32,000 tokens - **Positional Encoding:** RoPE (Rotary Position Embeddings) ### Vision Encoder - **Image Size:** 224x224 - **Patch Size:** 16x16 - **Hidden Size:** 768 - **Layers:** 12 - **Image Tokens:** 196 per image - **Architecture:** ViT-style encoder ## Usage ### Installation ```bash pip install transformers pillow torch ``` ### Basic Image Understanding ```python from transformers import AutoProcessor, AutoModelForVision2Seq from PIL import Image import requests model_name = "Luke-Bergen/mineral-nano-1" processor = AutoProcessor.from_pretrained(model_name) model = AutoModelForVision2Seq.from_pretrained(model_name) # Load an image url = "https://example.com/image.jpg" image = Image.open(requests.get(url, stream=True).raw) # Prepare inputs prompt = "What is in this image?" inputs = processor(text=prompt, images=image, return_tensors="pt") # Generate response outputs = model.generate(**inputs, max_new_tokens=100) response = processor.decode(outputs[0], skip_special_tokens=True) print(response) ``` ### Multiple Images ```python from PIL import Image images = [ Image.open("image1.jpg"), Image.open("image2.jpg") ] prompt = "Describe the first image. Now describe the second image." inputs = processor(text=prompt, images=images, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=200) print(processor.decode(outputs[0], skip_special_tokens=True)) ``` ### Chat with Images ```python messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "What objects are in this image?"} ] } ] # Apply chat template text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = processor(text=text, images=image, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=150, temperature=0.7) print(processor.decode(outputs[0], skip_special_tokens=True)) ``` ### Local Images ```python from PIL import Image # Load local image image = Image.open("path/to/your/image.jpg") prompt = "Describe what you see in detail." inputs = processor(text=prompt, images=image, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=256) print(processor.decode(outputs[0], skip_special_tokens=True)) ``` ## Training Details - **Framework:** PyTorch with Transformers - **Training Data:** Text + Image pairs - **Training Duration:** [Specify training time] - **Hardware:** [Specify GPUs used] - **Vision Encoder:** Pretrained ViT encoder fine-tuned with language model ## Capabilities ✅ Image description and captioning ✅ Visual question answering ✅ Object detection and recognition ✅ Scene understanding ✅ Multi-image reasoning ✅ OCR and text extraction from images ## Limitations - Limited to 224x224 resolution images - Context window of 2048 tokens including image tokens - May struggle with fine-grained details - Best for general image understanding tasks - Compact size means reduced capabilities compared to larger VLMs - Limited multilingual vision capabilities ## Intended Use This model is designed for: - Educational purposes and learning VLM architectures - Prototyping multimodal applications - Low-resource deployment scenarios - Fast inference with vision capabilities - Mobile and edge device applications - Personal projects requiring image understanding ## Image Preprocessing Images are automatically: - Resized to 224x224 - Normalized with CLIP-style statistics - Converted to RGB - Split into 16x16 patches (196 total patches) ## Performance Tips - Use square images when possible for best results - Ensure images are clear and well-lit - Keep prompts concise and specific - Use batch processing for multiple images - Enable `use_cache=True` for faster generation ## License [Specify your license - e.g., MIT, Apache 2.0, etc.] ## Citation ```bibtex @misc{mineral-nano-1-vision, author = {Luke Bergen}, title = {Mineral Nano 1 Vision: A Compact Vision-Language Model}, year = {2025}, publisher = {HuggingFace}, url = {https://huggingface.co/Luke-Bergen/mineral-nano-1} } ``` ## Contact For questions or issues, please open an issue on the model repository. ## Acknowledgments This model builds upon research in vision transformers and multimodal learning.