YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

MiniCPM-V-4.5-abliterated-int8

This is an 8-bit quantized version of huihui-ai/Huihui-MiniCPM-V-4_5-abliterated using bitsandbytes int8 quantization.

Model Details

Base Model: huihui-ai/Huihui-MiniCPM-V-4_5-abliterated
Quantization: 8-bit integer using bitsandbytes
Model Size: ~9.35 GB (79.4% reduction from original 45.28 GB)
Compute dtype: float16
Quantization method: LLM.int8() with mixed-precision decomposition

Quantization Configuration

{
  "load_in_8bit": true,
  "bnb_8bit_compute_dtype": "float16",
  "bnb_8bit_quant_type": "int8",
  "llm_int8_skip_modules": ["lm_head", "vision"],
  "llm_int8_threshold": 6.0,
  "quant_method": "bitsandbytes"
}

Key Features

Mixed Precision: Uses int8 for weights with fp16 for activations
Outlier Management: Automatically handles outliers in fp16 for better accuracy
Selective Quantization: Skips critical modules (lm_head, vision) to preserve quality
Better accuracy than int4: While larger than 4-bit, provides significantly better quality

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "wavespeed/MiniCPM-V-4_5-abliterated-int8",
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(
    "wavespeed/MiniCPM-V-4_5-abliterated-int8",
    trust_remote_code=True
)

# For inference
# The model will automatically use int8 weights with fp16 compute

Requirements

transformers>=4.35.0
bitsandbytes>=0.41.0
torch>=2.0.0
accelerate>=0.20.0
CUDA-capable GPU (int8 quantization requires CUDA)

Performance Notes

Memory Usage: ~9.35 GB VRAM required
Speed: Slightly slower than fp16 due to dequantization overhead
Quality: Better preservation of model quality compared to 4-bit quantization
Best for: Users who need better quality than 4-bit but still want memory savings

Comparison with Other Quantizations

Version	Size	Relative Quality	Use Case
Original (fp16)	45.28 GB	Best	Maximum quality, high VRAM
int8 (this)	9.35 GB	Very Good	Balanced quality/memory
int4	6.09 GB	Good	Maximum memory savings

License

Same as the original model - please refer to the base model's license.

Acknowledgments

Original model by huihui-ai
Quantization using bitsandbytes LLM.int8() method

Downloads last month: 8

Safetensors

Model size

9B params

Tensor type

F32

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support