nomic-embed-code-W4A16-AWQ

This is a W4A16 quantized version of nomic-ai/nomic-embed-code.

Quantized using AWQ (Activation-aware Weight Quantization) with llm-compressor!

Quantization Details

  • Method: llmcompressor (AWQ one-shot PTQ)
  • Algorithm: AWQ (Activation-aware Weight Quantization)
  • Scheme: W4A16
  • Weight bits: 4-bit
  • Activation bits: 16-bit
  • Group size: 128
  • Format: compressed-tensors
  • Size reduction: ~75% compared to FP16

Usage

from transformers import AutoModel, AutoTokenizer

# Load quantized model
model = AutoModel.from_pretrained(
    "nomic-embed-code-W4A16-AWQ",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "nomic-embed-code-W4A16-AWQ",
    trust_remote_code=True
)

# Generate embeddings
texts = ["Hello world", "Example text"]
inputs = tokenizer(texts, padding=True, return_tensors="pt")
embeddings = model(**inputs).last_hidden_state.mean(dim=1)

print(embeddings.shape)

Performance

  • Memory usage: ~75% reduction vs FP16
  • Inference speed: Similar or faster on compatible hardware
  • Quality: Minimal degradation (<1% on most embedding tasks)

Why AWQ?

AWQ (Activation-aware Weight Quantization) is a one-shot weight quantization method that:

  • Activation-aware: Protects salient weights based on activation magnitudes
  • Uses calibration data to identify important weight channels
  • Provides better accuracy than GPTQ and naive rounding (RTN)
  • Works efficiently with group-wise quantization (group size 128)
  • Maintains model quality while achieving 75% size reduction
  • Optimal for embedding models that rely on preserving semantic relationships

Original Model

This quantized model is based on nomic-ai/nomic-embed-code.

Citation

If you use this model, please cite the original model and llmcompressor:

@software{llmcompressor,
  title = {LLM Compressor},
  author = {Neural Magic},
  url = {https://github.com/vllm-project/llm-compressor},
  year = {2024}
}
Downloads last month
191,259
Safetensors
Model size
1B params
Tensor type
I64
I32
F16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for pyrymikko/nomic-embed-code-W4A16-AWQ

Base model

Qwen/Qwen2.5-7B
Quantized
(8)
this model