Octen-Embedding-0.6B — INT8 ONNX (SmoothQuant α=0.8, per-channel)
INT8-quantized ONNX of Octen/Octen-Embedding-0.6B. ~1.06 GB, ~50 % memory of FP32, validated to keep retrieval quality on a multilingual probe set.
Quality
Per-row cosine similarity vs the upstream PyTorch model on a 6-text multilingual probe set (English + German), computed with identical token IDs:
| Variant of this repo | cos_min | cos_mean |
|---|---|---|
model.int8.onnx (SmoothQuant α=0.8) |
0.987 | 0.992 |
model.int8.vanilla.onnx (kept for archive) |
0.639 | 0.846 |
The previous (vanilla quantize_dynamic) artifact collapsed on Qwen3-class decoder LLMs because of activation outliers: matrix multiplies with a small number of large-magnitude activations exceed INT8 dynamic range, and per-tensor / per-channel naive quantization has nowhere to put them. The German "Klimawandel" sentence in our probe set was the worst case (cos≈0.64 on F2LLM, ≈0.64 on Octen).
SmoothQuant (Xiao et al. 2023) migrates these outliers from activations into weights via a per-channel scaling: Y = (X / s) · (s · W). After scaling, the outliers live in s · W, and the now-balanced X / s quantizes cleanly. α=0.8 was the LLM-class recommendation; smaller α moves more outliers into weights at the cost of weight quantization quality.
The fastembed-rs cosine-parity CI harness asserts cos_min ≥ 0.90 against this artifact.
Files
| File | Description |
|---|---|
model.int8.onnx |
Current SmoothQuant α=0.8 INT8 weights graph (use this). |
model.int8.onnx.data |
External data sidecar for the above (use_external_data_format=True). |
model.int8.vanilla.onnx |
Archived original vanilla quantize_dynamic INT8 — DO NOT use for retrieval; kept only for reproducibility of historical reports. |
model.int8.vanilla.onnx.data |
External data sidecar for the archived vanilla artifact. |
tokenizer.json, tokenizer_config.json, special_tokens_map.json, added_tokens.json, config.json, merges.txt, vocab.json |
Tokenizer + config copied from the upstream PyTorch repo. |
Quantization recipe (reproducible)
# 1. SmoothQuant pre-processing (migrate outliers into weights)
python smoothquant_onnx.py \
--fp32 model.onnx --output model.smoothed.fp32.onnx \
--tokenizer <upstream snapshot> --alpha 0.8
# 2. Standard per-channel dynamic INT8 quantize on the smoothed FP32
python -c "from onnxruntime.quantization import quantize_dynamic, QuantType; \
quantize_dynamic('model.smoothed.fp32.onnx', 'model.smoothed.int8.onnx', \
per_channel=True, op_types_to_quantize=['MatMul'], \
weight_type=QuantType.QInt8, use_external_data_format=True)"
The full driver lives at github.com/CrispStrobe/fastembed-rs
under tools/dump_reference.py (validation) and the wip/validation branch's
/scripts/smoothquant_onnx.py + /scripts/quant_smoothed_int8.py.
Usage
This artifact is consumed by fastembed-rs under the
canonical model_code cstr/Octen-Embedding-0.6B-ONNX-INT8 with model_file = "model.int8.onnx". Direct ORT usage (ONNX Runtime ≥ 1.17) is straightforward — load the .onnx and ORT will discover the .data sidecar automatically as long as both files sit in the same directory.
License
Apache 2.0, inherited from upstream Octen/Octen-Embedding-0.6B.
Change history
- 2026-05-03 — Replaced
model.int8.onnxwith the SmoothQuant α=0.8 export. Original vanilla INT8 archived asmodel.int8.vanilla.onnx. Reason: vanillaquantize_dynamicproduced cos_min=0.64 on this Qwen3-class decoder LLM (catastrophic outlier collapse on multilingual inputs); SmoothQuant recovers cos_min=0.99. - Original upload — vanilla
quantize_dynamicper-channel INT8 export. Now archived.
- Downloads last month
- 64
Model tree for cstr/Octen-Embedding-0.6B-ONNX-INT8
Base model
Qwen/Qwen3-0.6B-Base