YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
MiniCPM-V-4.5-abliterated-int8
This is an 8-bit quantized version of huihui-ai/Huihui-MiniCPM-V-4_5-abliterated using bitsandbytes int8 quantization.
Model Details
- Base Model: huihui-ai/Huihui-MiniCPM-V-4_5-abliterated
- Quantization: 8-bit integer using bitsandbytes
- Model Size: ~9.35 GB (79.4% reduction from original 45.28 GB)
- Compute dtype: float16
- Quantization method: LLM.int8() with mixed-precision decomposition
Quantization Configuration
{
"load_in_8bit": true,
"bnb_8bit_compute_dtype": "float16",
"bnb_8bit_quant_type": "int8",
"llm_int8_skip_modules": ["lm_head", "vision"],
"llm_int8_threshold": 6.0,
"quant_method": "bitsandbytes"
}
Key Features
- Mixed Precision: Uses int8 for weights with fp16 for activations
- Outlier Management: Automatically handles outliers in fp16 for better accuracy
- Selective Quantization: Skips critical modules (lm_head, vision) to preserve quality
- Better accuracy than int4: While larger than 4-bit, provides significantly better quality
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"wavespeed/MiniCPM-V-4_5-abliterated-int8",
device_map="auto",
trust_remote_code=True,
torch_dtype=torch.float16
)
tokenizer = AutoTokenizer.from_pretrained(
"wavespeed/MiniCPM-V-4_5-abliterated-int8",
trust_remote_code=True
)
# For inference
# The model will automatically use int8 weights with fp16 compute
Requirements
- transformers>=4.35.0
- bitsandbytes>=0.41.0
- torch>=2.0.0
- accelerate>=0.20.0
- CUDA-capable GPU (int8 quantization requires CUDA)
Performance Notes
- Memory Usage: ~9.35 GB VRAM required
- Speed: Slightly slower than fp16 due to dequantization overhead
- Quality: Better preservation of model quality compared to 4-bit quantization
- Best for: Users who need better quality than 4-bit but still want memory savings
Comparison with Other Quantizations
| Version | Size | Relative Quality | Use Case |
|---|---|---|---|
| Original (fp16) | 45.28 GB | Best | Maximum quality, high VRAM |
| int8 (this) | 9.35 GB | Very Good | Balanced quality/memory |
| int4 | 6.09 GB | Good | Maximum memory savings |
License
Same as the original model - please refer to the base model's license.
Acknowledgments
- Original model by huihui-ai
- Quantization using bitsandbytes LLM.int8() method
- Downloads last month
- 8
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
馃檵
Ask for provider support