--- license: apache-2.0 pipeline_tag: text-generation library_name: transformers tags: - multilingual - compliant - swiss-ai - apertus - fp8 - vllm - compressed-tensors - llm-compressor base_model: - swiss-ai/Apertus-70B-Instruct-2509 --- ## Model Overview - **Model Architecture:** ApertusForCausalLM - **Input:** Text - **Output:** Text - **Model Optimizations:** - **Weight quantization:** INT4 - **Release Date:** 9/22/2025 - **Version:** 1.0 - **Model Developers:** Red Hat Quantized version of [swiss-ai/Apertus-70B-2509](https://huggingface.co/swiss-ai/Apertus-70B-2509). ### Model Optimizations This model was obtained by quantizing the weights and activations of [swiss-ai/Apertus-70B-2509](https://huggingface.co/swiss-ai/Apertus-70B-2509) to FP8 data type. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. Only the weights and activations of the linear operators within transformers blocks are quantized. ## Deployment ### Use with vLLM 1. Initialize vLLM server: ``` vllm serve RedHatAI/Apertus-70B-Instruct-2509-quantized.w4a16 ``` 2. Send requests to the server: ```python from openai import OpenAI # Modify OpenAI's API key and API base to use vLLM's API server. openai_api_key = "EMPTY" openai_api_base = "http://:8000/v1" client = OpenAI( api_key=openai_api_key, base_url=openai_api_base, ) model = "RedHatAI/Apertus-70B-Instruct-2509-quantized.w4a16" messages = [ {"role": "user", "content": "Give me a short introduction to large language model."}, ] outputs = client.chat.completions.create( model=model, messages=messages, ) generated_text = outputs.choices[0].message.content print(generated_text) ``` ## Creation This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
Model Creation Code ```python from llmcompressor.modifiers.quantization import QuantizationModifier from llmcompressor.transformers import oneshot from transformers import AutoModelForCausalLM, AutoTokenizer # Load model model_stub = "swiss-ai/Apertus-70B-Instruct-2509" model_name = model_stub.split("/")[-1] model = AutoModelForCausalLM.from_pretrained(model_stub, dtype="auto") tokenizer = AutoTokenizer.from_pretrained(model_stub) # Configure the quantization algorithm and scheme recipe = QuantizationModifier( ignore=["lm_head"], targets="Linear", scheme="FP8_dynamic", ) # Apply quantization oneshot( model=model, recipe=recipe, ) # Save to disk in compressed-tensors format save_path = model_name + "-quantized.w4a16" model.save_pretrained(save_path) tokenizer.save_pretrained(save_path) print(f"Model and tokenizer saved to: {save_path}") ```
## Evaluation The model was evaluated on OpenLLM Leaderboard [V1](https://huggingface.co/spaces/open-llm-leaderboard-old/open_llm_leaderboard), using the following command:
Evaluation Commands OpenLLM Leaderboard V1: ``` lm_eval \ --model vllm \ --model_args pretrained="RedHatAI/Apertus-70B-Instruct-2509-quantized.w4a16",dtype=auto,add_bos_token=True,max_model_len=4096,tensor_parallel_size=2,gpu_memory_utilization=0.2,enable_chunked_prefill=True \ --tasks openllm \ --write_out \ --batch_size auto \ --output_path output_dir \ --show_config ```
### Accuracy
Category Metric swiss-ai/Apertus-70B-Instruct-2509 RedHatAI/Apertus-70B-Instruct-2509-quantized.w4a16 Recovery (%)
OpenLLM V1 ARC-Challenge (Acc-Norm, 25-shot) 70.82 70.65 99.8
GSM8K (Strict-Match, 5-shot) 73.69 73.45 99.7
HellaSwag (Acc-Norm, 10-shot) 86.23 85.67 99.4
MMLU (Acc, 5-shot) 69.21 68.25 98.6
TruthfulQA (MC2, 0-shot) 60.31 60.55 100.4
Winogrande (Acc, 5-shot) 80.74 80.03 99.1
Average Score 73.50 73.10 99.5