Mineral-Nano-1 / README.md

Update README.md

7e32238 verified 23 days ago

5.01 kB

	# Mineral Nano 1 Vision

	Mineral Nano 1 Vision is a compact, efficient vision-language model designed for fast inference and low-resource environments with multimodal capabilities.

	## Model Details

	- Model Name: mineral-nano-1
	- Model Type: Vision-Language Model (VLM)
	- Parameters: ~110M parameters
	- Context Length: 2048 tokens
	- Architecture: Transformer-based decoder with vision encoder (12 layers)
	- Precision: BFloat16
	- Image Resolution: 224x224
	- Modalities: Text + Images

	## Architecture

	### Language Model
	- Hidden Size: 768
	- Intermediate Size: 3072
	- Attention Heads: 12
	- Layers: 12
	- Vocabulary Size: 32,000 tokens
	- Positional Encoding: RoPE (Rotary Position Embeddings)

	### Vision Encoder
	- Image Size: 224x224
	- Patch Size: 16x16
	- Hidden Size: 768
	- Layers: 12
	- Image Tokens: 196 per image
	- Architecture: ViT-style encoder

	## Usage

	### Installation

	```bash
	pip install transformers pillow torch
	```

	### Basic Image Understanding

	```python
	from transformers import AutoProcessor, AutoModelForVision2Seq
	from PIL import Image
	import requests

	model_name = "your-username/mineral-nano-1"
	processor = AutoProcessor.from_pretrained(model_name)
	model = AutoModelForVision2Seq.from_pretrained(model_name)

	# Load an image
	url = "https://example.com/image.jpg"
	image = Image.open(requests.get(url, stream=True).raw)

	# Prepare inputs
	prompt = "<image>What is in this image?"
	inputs = processor(text=prompt, images=image, return_tensors="pt")

	# Generate response
	outputs = model.generate(**inputs, max_new_tokens=100)
	response = processor.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```

	### Multiple Images

	```python
	from PIL import Image

	images = [
	Image.open("image1.jpg"),
	Image.open("image2.jpg")
	]

	prompt = "<image>Describe the first image. <image>Now describe the second image."
	inputs = processor(text=prompt, images=images, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=200)
	print(processor.decode(outputs[0], skip_special_tokens=True))
	```

	### Chat with Images

	```python
	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image"},
	{"type": "text", "text": "What objects are in this image?"}
	]
	}
	]

	# Apply chat template
	text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = processor(text=text, images=image, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=150, temperature=0.7)
	print(processor.decode(outputs[0], skip_special_tokens=True))
	```

	### Local Images

	```python
	from PIL import Image

	# Load local image
	image = Image.open("path/to/your/image.jpg")

	prompt = "<image>Describe what you see in detail."
	inputs = processor(text=prompt, images=image, return_tensors="pt")
	outputs = model.generate(**inputs, max_new_tokens=256)
	print(processor.decode(outputs[0], skip_special_tokens=True))
	```

	## Training Details

	- Framework: PyTorch with Transformers
	- Training Data: Text + Image pairs
	- Training Duration: [Specify training time]
	- Hardware: [Specify GPUs used]
	- Vision Encoder: Pretrained ViT encoder fine-tuned with language model

	## Capabilities

	✅ Image description and captioning
	✅ Visual question answering
	✅ Object detection and recognition
	✅ Scene understanding
	✅ Multi-image reasoning
	✅ OCR and text extraction from images

	## Limitations

	- Limited to 224x224 resolution images
	- Context window of 2048 tokens including image tokens
	- May struggle with fine-grained details
	- Best for general image understanding tasks
	- Compact size means reduced capabilities compared to larger VLMs
	- Limited multilingual vision capabilities

	## Intended Use

	This model is designed for:
	- Educational purposes and learning VLM architectures
	- Prototyping multimodal applications
	- Low-resource deployment scenarios
	- Fast inference with vision capabilities
	- Mobile and edge device applications
	- Personal projects requiring image understanding

	## Image Preprocessing

	Images are automatically:
	- Resized to 224x224
	- Normalized with CLIP-style statistics
	- Converted to RGB
	- Split into 16x16 patches (196 total patches)

	## Performance Tips

	- Use square images when possible for best results
	- Ensure images are clear and well-lit
	- Keep prompts concise and specific
	- Use batch processing for multiple images
	- Enable `use_cache=True` for faster generation

	## License

	[Specify your license - e.g., MIT, Apache 2.0, etc.]

	## Citation

	```bibtex
	@misc{mineral-nano-1-vision,
	author = {Your Name},
	title = {Mineral Nano 1 Vision: A Compact Vision-Language Model},
	year = {2025},
	publisher = {HuggingFace},
	url = {https://huggingface.co/your-username/mineral-nano-1}
	}
	```

	## Contact

	For questions or issues, please open an issue on the model repository.

	## Acknowledgments

	This model builds upon research in vision transformers and multimodal learning.