Octen-Embedding-0.6B — INT8 ONNX (SmoothQuant α=0.8, per-channel)

INT8-quantized ONNX of Octen/Octen-Embedding-0.6B. ~1.06 GB, ~50 % memory of FP32, validated to keep retrieval quality on a multilingual probe set.

Quality

Per-row cosine similarity vs the upstream PyTorch model on a 6-text multilingual probe set (English + German), computed with identical token IDs:

Variant of this repo cos_min cos_mean
model.int8.onnx (SmoothQuant α=0.8) 0.987 0.992
model.int8.vanilla.onnx (kept for archive) 0.639 0.846

The previous (vanilla quantize_dynamic) artifact collapsed on Qwen3-class decoder LLMs because of activation outliers: matrix multiplies with a small number of large-magnitude activations exceed INT8 dynamic range, and per-tensor / per-channel naive quantization has nowhere to put them. The German "Klimawandel" sentence in our probe set was the worst case (cos≈0.64 on F2LLM, ≈0.64 on Octen).

SmoothQuant (Xiao et al. 2023) migrates these outliers from activations into weights via a per-channel scaling: Y = (X / s) · (s · W). After scaling, the outliers live in s · W, and the now-balanced X / s quantizes cleanly. α=0.8 was the LLM-class recommendation; smaller α moves more outliers into weights at the cost of weight quantization quality.

The fastembed-rs cosine-parity CI harness asserts cos_min ≥ 0.90 against this artifact.

Files

File Description
model.int8.onnx Current SmoothQuant α=0.8 INT8 weights graph (use this).
model.int8.onnx.data External data sidecar for the above (use_external_data_format=True).
model.int8.vanilla.onnx Archived original vanilla quantize_dynamic INT8 — DO NOT use for retrieval; kept only for reproducibility of historical reports.
model.int8.vanilla.onnx.data External data sidecar for the archived vanilla artifact.
tokenizer.json, tokenizer_config.json, special_tokens_map.json, added_tokens.json, config.json, merges.txt, vocab.json Tokenizer + config copied from the upstream PyTorch repo.

Quantization recipe (reproducible)

# 1. SmoothQuant pre-processing (migrate outliers into weights)
python smoothquant_onnx.py \
    --fp32 model.onnx --output model.smoothed.fp32.onnx \
    --tokenizer <upstream snapshot> --alpha 0.8

# 2. Standard per-channel dynamic INT8 quantize on the smoothed FP32
python -c "from onnxruntime.quantization import quantize_dynamic, QuantType; \
    quantize_dynamic('model.smoothed.fp32.onnx', 'model.smoothed.int8.onnx', \
        per_channel=True, op_types_to_quantize=['MatMul'], \
        weight_type=QuantType.QInt8, use_external_data_format=True)"

The full driver lives at github.com/CrispStrobe/fastembed-rs under tools/dump_reference.py (validation) and the wip/validation branch's /scripts/smoothquant_onnx.py + /scripts/quant_smoothed_int8.py.

Usage

This artifact is consumed by fastembed-rs under the canonical model_code cstr/Octen-Embedding-0.6B-ONNX-INT8 with model_file = "model.int8.onnx". Direct ORT usage (ONNX Runtime ≥ 1.17) is straightforward — load the .onnx and ORT will discover the .data sidecar automatically as long as both files sit in the same directory.

License

Apache 2.0, inherited from upstream Octen/Octen-Embedding-0.6B.


Change history

  • 2026-05-03 — Replaced model.int8.onnx with the SmoothQuant α=0.8 export. Original vanilla INT8 archived as model.int8.vanilla.onnx. Reason: vanilla quantize_dynamic produced cos_min=0.64 on this Qwen3-class decoder LLM (catastrophic outlier collapse on multilingual inputs); SmoothQuant recovers cos_min=0.99.
  • Original upload — vanilla quantize_dynamic per-channel INT8 export. Now archived.
Downloads last month
64
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cstr/Octen-Embedding-0.6B-ONNX-INT8

Quantized
(14)
this model

Paper for cstr/Octen-Embedding-0.6B-ONNX-INT8