--- inference: false base_model: c4ai/command-a-03-2025 pipeline_tag: text-generation model_type: command-a tags: - quantization - onebit - compression - command-a - text-generation library_name: transformers language: - en - ja license: cc-by-nc-4.0 extra_gated_prompt: "By submitting this form, you agree to the [License Agreement](https://cohere.com/c4ai-cc-by-nc-license) and acknowledge that the information you provide will be collected, used, and shared in accordance with Cohere’s [Privacy Policy]( https://cohere.com/privacy). You’ll receive email updates about Cohere Labs and Cohere research, events, products and services. You can unsubscribe at any time." extra_gated_fields: Name: text Affiliation: text Country: country I agree to use this model for non-commercial use ONLY: checkbox --- --- # **Model Card for qep qep 1bit extreme** 🚨 **This model is 1bit quantized version of Cohere Labs Command A using QEP.** You can find the unquantized version of Cohere Labs Command A [here](https://huggingface.co/CohereLabs/c4ai-command-a-03-2025). ## **Model Summary** An optimized 1-bit quantized version of [c4ai/command-a-03-2025](https://huggingface.co/CohereLabs/c4ai-command-a-03-2025) achieving **6.7x compression** with enhanced performance through advanced quantization optimization techniques. ## Key Features - **Extreme Compression**: 6.7× smaller (207GB → 30.2GB, -85%), runs even on a single GPU (30B on A100 80GB). - **Enhanced Performance**: [Onebit](https://arxiv.org/abs/2402.11295) quantization, enhanced by Fujitsu [QEP](https://arxiv.org/abs/2504.09629) & [QQA](https://iclr.cc/virtual/2025/poster/30713). - **Inference Speed Up**: Faster inference via "Bitlinear computation". ## Model Details - **Base Model**: c4ai/command-a-03-2025 - **Quantization Method**: [OneBit](https://openreview.net/forum?id=ZwiG9KjfHV) with Fujitsu [QEP](https://arxiv.org/abs/2504.09629)/[QQA](https://iclr.cc/virtual/2025/poster/30713) optimization - **Quantization Bits**: 1-bit for layers 0-61, FP16 for last 2 layers - **Optimization Techniques**: Fujitsu [QEP](https://arxiv.org/abs/2504.09629), [QQA](https://iclr.cc/virtual/2025/poster/30713) - **Compatible Hardware**: Single GPU (recommended: >= 40GB VRAM) Developed by: [Fujitsu](https:/fujitsu.com/), [Cohere](https://cohere.com/) and [Cohere Labs](https://cohere.for.ai/) * Point of Contact: [Contact form](https://contactline.jp.fujitsu.com/customform/csque04802/873532/) or [Email](fj-qep@dl.jp.fujitsu.com) * License:[CC-BY-NC](https://cohere.com/cohere-labs-cc-by-nc-license), requires also adhering to [Cohere Lab's Acceptable Use Policy](https://docs.cohere.com/docs/cohere-labs-acceptable-use-policy) For more details on how this model was developed, check out our [Press Release (English)](https://global.fujitsu/-/media/Project/Fujitsu/Fujitsu-HQ/pr/news/2025/09/08-01-en.pdf), [Press Release (Japanese)](https://global.fujitsu/ja-jp/pr/news/2025/09/08-01) Fujitsu's [Tech Report](https://arxiv.org/abs/2504.09629) and Cohere's [Tech Report](https://arxiv.org/abs/2504.00698). ## Usage The base architecture of this model is **Command-A**. To load and use the model, please use the **CommandA model class**: 1. Load `model.safetensors`, which contains the quantized weights. 2. Replace all layers **except the last two** with **bitlinear implementations**. 3. Keep the **last two layers with non-quantized weights** for optimal performance. 4. The model requires the included `onebit_linear.py` for proper quantized layer implementation. The weights contain parameters for each of the **OneBit-specific a, S, and b components** necessary for reconstruction. 5. Depending on the level of performance you wish to maintain, you may keep additional layers near the output unquantized. **Note:** Direct loading support as an extension of the `transformers` package is planned for future releases. ## Requirements ``` torch>=2.0.0 transformers>=4.35.0 safetensors>=0.4.0 ``` ## Performance - **Memory Usage**: 6.7x reduction overall (207GB → 30.2GB) - **Inference Speed**: Optimized for fast generation on single GPU - **Quality**: Enhanced performance through [QEP](https://arxiv.org/abs/2504.09629)/[QQA](https://iclr.cc/virtual/2025/poster/30713) optimization - **Compatibility**: Single GPU deployment capable ## Technical Specifications - **Original Model**: Command-A (c4ai/command-a-03-2025) - **Quantized Layers**: 62 layers (0-61) with 1-bit precision - **Preserved Layers**: 2 layers (62-63) with FP16 precision - **Compression Technique**: [OneBit](https://openreview.net/forum?id=ZwiG9KjfHV) + Fujitsu [QEP](https://arxiv.org/abs/2504.09629)/[QQA](https://iclr.cc/virtual/2025/poster/30713) - **Model Size**: 30.2GB (from original 207GB) ## Future Plans - **Global and Block-wise Fine-tuning**: Explore fine-tuning strategies, including block-wise methods, to further improve accuracy and robustness. - **Complete Usage Examples**: Provide detailed implementation guides for efficient single-GPU deployment. - **Optimization Updates**: Enhance performance with next-generation quantization techniques and improved reconstruction methods. Currently, the quantization process preserves the last two layers in **non-quantized weights** to maintain output quality, while applying aggressive **1-bit quantization** to the remaining layers. Future releases will integrate **block-wise fine-tuning** for additional performance gains. ## Ethical Considerations This model inherits the capabilities and limitations of the base Command A model. Please refer to the original model's documentation for ethical guidelines and potential biases. ## **Model Card Contact** For errors or additional questions about details in this model card, contact fj-qep@dl.jp.fujitsu.com ## **Terms of Use:** We hope that the release of this model will make community-based research efforts more accessible, by releasing the weights of a highly performant model to researchers all over the world. This model is governed by a [CC-BY-NC](https://cohere.com/cohere-labs-cc-by-nc-license), requires also adhering to [Cohere Lab's Acceptable Use Policy](https://docs.cohere.com/docs/cohere-labs-acceptable-use-policy) ## **Citation** If you use this model, please cite: ```bibtex @misc{command-a-onebit-hybrid, title={Command-A 111B with QEP-Optimized OneBit Extreme Quantization}, author={Yuma Ichikawa, Yusei Kawakami, Yoshiyuki Ishii, Keiji Kimura and Akira Sakai}, year={2025}, publisher={Hugging Face}, url={https://huggingface.co/qep/qep-1bit-extreme} } ``` ## License This quantized model is released under the same license as the base Command A model (CC-BY-NC-4.0). ---