Octen-Embedding-0.6B — INT8 ONNX (SmoothQuant α=0.8, per-channel)

INT8-quantized ONNX of Octen/Octen-Embedding-0.6B. ~1.06 GB, ~50 % memory of FP32, validated to keep retrieval quality on a multilingual probe set.

Quality

Per-row cosine similarity vs the upstream PyTorch model on a 6-text multilingual probe set (English + German), computed with identical token IDs:

Variant of this repo	cos_min	cos_mean
`model.int8.onnx` (SmoothQuant α=0.8)	0.987	0.992
`model.int8.vanilla.onnx` (kept for archive)	0.639	0.846

The previous (vanilla quantize_dynamic) artifact collapsed on Qwen3-class decoder LLMs because of activation outliers: matrix multiplies with a small number of large-magnitude activations exceed INT8 dynamic range, and per-tensor / per-channel naive quantization has nowhere to put them. The German "Klimawandel" sentence in our probe set was the worst case (cos≈0.64 on F2LLM, ≈0.64 on Octen).

SmoothQuant (Xiao et al. 2023) migrates these outliers from activations into weights via a per-channel scaling: Y = (X / s) · (s · W). After scaling, the outliers live in s · W, and the now-balanced X / s quantizes cleanly. α=0.8 was the LLM-class recommendation; smaller α moves more outliers into weights at the cost of weight quantization quality.

The fastembed-rs cosine-parity CI harness asserts cos_min ≥ 0.90 against this artifact.

Files

File	Description
`model.int8.onnx`	Current SmoothQuant α=0.8 INT8 weights graph (use this).
`model.int8.onnx.data`	External data sidecar for the above (`use_external_data_format=True`).
`model.int8.vanilla.onnx`	Archived original vanilla `quantize_dynamic` INT8 — DO NOT use for retrieval; kept only for reproducibility of historical reports.
`model.int8.vanilla.onnx.data`	External data sidecar for the archived vanilla artifact.
`tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, `added_tokens.json`, `config.json`, `merges.txt`, `vocab.json`	Tokenizer + config copied from the upstream PyTorch repo.

Quantization recipe (reproducible)

# 1. SmoothQuant pre-processing (migrate outliers into weights)
python smoothquant_onnx.py \
    --fp32 model.onnx --output model.smoothed.fp32.onnx \
    --tokenizer <upstream snapshot> --alpha 0.8

# 2. Standard per-channel dynamic INT8 quantize on the smoothed FP32
python -c "from onnxruntime.quantization import quantize_dynamic, QuantType; \
    quantize_dynamic('model.smoothed.fp32.onnx', 'model.smoothed.int8.onnx', \
        per_channel=True, op_types_to_quantize=['MatMul'], \
        weight_type=QuantType.QInt8, use_external_data_format=True)"

The full driver lives at github.com/CrispStrobe/fastembed-rs under tools/dump_reference.py (validation) and the wip/validation branch's /scripts/smoothquant_onnx.py + /scripts/quant_smoothed_int8.py.

Usage

This artifact is consumed by fastembed-rs under the canonical model_code cstr/Octen-Embedding-0.6B-ONNX-INT8 with model_file = "model.int8.onnx". Direct ORT usage (ONNX Runtime ≥ 1.17) is straightforward — load the .onnx and ORT will discover the .data sidecar automatically as long as both files sit in the same directory.

License

Apache 2.0, inherited from upstream Octen/Octen-Embedding-0.6B.

Change history

2026-05-03 — Replaced model.int8.onnx with the SmoothQuant α=0.8 export. Original vanilla INT8 archived as model.int8.vanilla.onnx. Reason: vanilla quantize_dynamic produced cos_min=0.64 on this Qwen3-class decoder LLM (catastrophic outlier collapse on multilingual inputs); SmoothQuant recovers cos_min=0.99.
Original upload — vanilla quantize_dynamic per-channel INT8 export. Now archived.

Downloads last month: 64

Model tree for cstr/Octen-Embedding-0.6B-ONNX-INT8

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-Embedding-0.6B

Finetuned

Octen/Octen-Embedding-0.6B

Quantized

(14)

this model

Paper for cstr/Octen-Embedding-0.6B-ONNX-INT8

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Paper • 2211.10438 • Published Nov 18, 2022 • 6