YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

MiniCPM-V-4.5-abliterated-int8

This is an 8-bit quantized version of huihui-ai/Huihui-MiniCPM-V-4_5-abliterated using bitsandbytes int8 quantization.

Model Details

  • Base Model: huihui-ai/Huihui-MiniCPM-V-4_5-abliterated
  • Quantization: 8-bit integer using bitsandbytes
  • Model Size: ~9.35 GB (79.4% reduction from original 45.28 GB)
  • Compute dtype: float16
  • Quantization method: LLM.int8() with mixed-precision decomposition

Quantization Configuration

{
  "load_in_8bit": true,
  "bnb_8bit_compute_dtype": "float16",
  "bnb_8bit_quant_type": "int8",
  "llm_int8_skip_modules": ["lm_head", "vision"],
  "llm_int8_threshold": 6.0,
  "quant_method": "bitsandbytes"
}

Key Features

  • Mixed Precision: Uses int8 for weights with fp16 for activations
  • Outlier Management: Automatically handles outliers in fp16 for better accuracy
  • Selective Quantization: Skips critical modules (lm_head, vision) to preserve quality
  • Better accuracy than int4: While larger than 4-bit, provides significantly better quality

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "wavespeed/MiniCPM-V-4_5-abliterated-int8",
    device_map="auto",
    trust_remote_code=True,
    torch_dtype=torch.float16
)

tokenizer = AutoTokenizer.from_pretrained(
    "wavespeed/MiniCPM-V-4_5-abliterated-int8",
    trust_remote_code=True
)

# For inference
# The model will automatically use int8 weights with fp16 compute

Requirements

  • transformers>=4.35.0
  • bitsandbytes>=0.41.0
  • torch>=2.0.0
  • accelerate>=0.20.0
  • CUDA-capable GPU (int8 quantization requires CUDA)

Performance Notes

  • Memory Usage: ~9.35 GB VRAM required
  • Speed: Slightly slower than fp16 due to dequantization overhead
  • Quality: Better preservation of model quality compared to 4-bit quantization
  • Best for: Users who need better quality than 4-bit but still want memory savings

Comparison with Other Quantizations

Version Size Relative Quality Use Case
Original (fp16) 45.28 GB Best Maximum quality, high VRAM
int8 (this) 9.35 GB Very Good Balanced quality/memory
int4 6.09 GB Good Maximum memory savings

License

Same as the original model - please refer to the base model's license.

Acknowledgments

Downloads last month
8
Safetensors
Model size
9B params
Tensor type
F32
F16
I8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support