| # Mineral Nano 1 Vision | |
| Mineral Nano 1 Vision is a compact, efficient vision-language model designed for fast inference and low-resource environments with multimodal capabilities. | |
| ## Model Details | |
| - **Model Name:** mineral-nano-1 | |
| - **Model Type:** Vision-Language Model (VLM) | |
| - **Parameters:** ~110M parameters | |
| - **Context Length:** 2048 tokens | |
| - **Architecture:** Transformer-based decoder with vision encoder (12 layers) | |
| - **Precision:** BFloat16 | |
| - **Image Resolution:** 224x224 | |
| - **Modalities:** Text + Images | |
| ## Architecture | |
| ### Language Model | |
| - **Hidden Size:** 768 | |
| - **Intermediate Size:** 3072 | |
| - **Attention Heads:** 12 | |
| - **Layers:** 12 | |
| - **Vocabulary Size:** 32,000 tokens | |
| - **Positional Encoding:** RoPE (Rotary Position Embeddings) | |
| ### Vision Encoder | |
| - **Image Size:** 224x224 | |
| - **Patch Size:** 16x16 | |
| - **Hidden Size:** 768 | |
| - **Layers:** 12 | |
| - **Image Tokens:** 196 per image | |
| - **Architecture:** ViT-style encoder | |
| ## Usage | |
| ### Installation | |
| ```bash | |
| pip install transformers pillow torch | |
| ``` | |
| ### Basic Image Understanding | |
| ```python | |
| from transformers import AutoProcessor, AutoModelForVision2Seq | |
| from PIL import Image | |
| import requests | |
| model_name = "your-username/mineral-nano-1" | |
| processor = AutoProcessor.from_pretrained(model_name) | |
| model = AutoModelForVision2Seq.from_pretrained(model_name) | |
| # Load an image | |
| url = "https://example.com/image.jpg" | |
| image = Image.open(requests.get(url, stream=True).raw) | |
| # Prepare inputs | |
| prompt = "<image>What is in this image?" | |
| inputs = processor(text=prompt, images=image, return_tensors="pt") | |
| # Generate response | |
| outputs = model.generate(**inputs, max_new_tokens=100) | |
| response = processor.decode(outputs[0], skip_special_tokens=True) | |
| print(response) | |
| ``` | |
| ### Multiple Images | |
| ```python | |
| from PIL import Image | |
| images = [ | |
| Image.open("image1.jpg"), | |
| Image.open("image2.jpg") | |
| ] | |
| prompt = "<image>Describe the first image. <image>Now describe the second image." | |
| inputs = processor(text=prompt, images=images, return_tensors="pt") | |
| outputs = model.generate(**inputs, max_new_tokens=200) | |
| print(processor.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ### Chat with Images | |
| ```python | |
| messages = [ | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "image"}, | |
| {"type": "text", "text": "What objects are in this image?"} | |
| ] | |
| } | |
| ] | |
| # Apply chat template | |
| text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) | |
| inputs = processor(text=text, images=image, return_tensors="pt") | |
| outputs = model.generate(**inputs, max_new_tokens=150, temperature=0.7) | |
| print(processor.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ### Local Images | |
| ```python | |
| from PIL import Image | |
| # Load local image | |
| image = Image.open("path/to/your/image.jpg") | |
| prompt = "<image>Describe what you see in detail." | |
| inputs = processor(text=prompt, images=image, return_tensors="pt") | |
| outputs = model.generate(**inputs, max_new_tokens=256) | |
| print(processor.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ## Training Details | |
| - **Framework:** PyTorch with Transformers | |
| - **Training Data:** Text + Image pairs | |
| - **Training Duration:** [Specify training time] | |
| - **Hardware:** [Specify GPUs used] | |
| - **Vision Encoder:** Pretrained ViT encoder fine-tuned with language model | |
| ## Capabilities | |
| β Image description and captioning | |
| β Visual question answering | |
| β Object detection and recognition | |
| β Scene understanding | |
| β Multi-image reasoning | |
| β OCR and text extraction from images | |
| ## Limitations | |
| - Limited to 224x224 resolution images | |
| - Context window of 2048 tokens including image tokens | |
| - May struggle with fine-grained details | |
| - Best for general image understanding tasks | |
| - Compact size means reduced capabilities compared to larger VLMs | |
| - Limited multilingual vision capabilities | |
| ## Intended Use | |
| This model is designed for: | |
| - Educational purposes and learning VLM architectures | |
| - Prototyping multimodal applications | |
| - Low-resource deployment scenarios | |
| - Fast inference with vision capabilities | |
| - Mobile and edge device applications | |
| - Personal projects requiring image understanding | |
| ## Image Preprocessing | |
| Images are automatically: | |
| - Resized to 224x224 | |
| - Normalized with CLIP-style statistics | |
| - Converted to RGB | |
| - Split into 16x16 patches (196 total patches) | |
| ## Performance Tips | |
| - Use square images when possible for best results | |
| - Ensure images are clear and well-lit | |
| - Keep prompts concise and specific | |
| - Use batch processing for multiple images | |
| - Enable `use_cache=True` for faster generation | |
| ## License | |
| [Specify your license - e.g., MIT, Apache 2.0, etc.] | |
| ## Citation | |
| ```bibtex | |
| @misc{mineral-nano-1-vision, | |
| author = {Your Name}, | |
| title = {Mineral Nano 1 Vision: A Compact Vision-Language Model}, | |
| year = {2025}, | |
| publisher = {HuggingFace}, | |
| url = {https://huggingface.co/your-username/mineral-nano-1} | |
| } | |
| ``` | |
| ## Contact | |
| For questions or issues, please open an issue on the model repository. | |
| ## Acknowledgments | |
| This model builds upon research in vision transformers and multimodal learning. |