--- base_model: cerebras/GLM-4.5-Air-REAP-82B-A12B tags: - text-generation-inference - transformers - unsloth - glm4_moe - pruning - REAP - MOE - glm - zirel license: apache-2.0 language: - en library_name: transformers --- # Daemontatox/Zirel-3 ## Model Description **Zirel-3** is a specialized finetune of [cerebras/GLM-4.5-Air-REAP-82B-A12B](https://huggingface.co/cerebras/GLM-4.5-Air-REAP-82B-A12B), a memory-efficient 82B active parameter Mixture-of-Experts (MoE) model compressed using the novel REAP (Router-weighted Expert Activation Pruning) technique. ### Base Model: GLM-4.5-Air-REAP-82B-A12B The base model is a compressed variant of GLM-4.5-Air that: - Maintains **near-identical performance** while being **25% lighter** (compressed from 110B to 82B total parameters) - Uses **82B parameters** (~12B activated per forward pass) - Employs the **REAP pruning method** which outperforms expert merging, especially on generative tasks - Retains full capabilities: code generation, agentic workflows, repository-scale understanding, and function calling - Achieves **drop-in compatibility** with vanilla vLLM (no custom patches required) ### REAP Technology REAP (Router-weighted Expert Activation Pruning) is a one-shot MoE compression method that: - Prunes low-impact experts based on router gate values and expert activation norms - Preserves the router's independent control over remaining experts - Significantly outperforms expert merging on generative benchmarks (code, creative writing, math) - Maintains 95-97% of baseline model quality even at high compression ratios **Paper**: [REAP the Experts: Why Pruning Prevails for One-Shot MoE compression](https://arxiv.org/abs/2510.13999) (Lasby et al., 2025) ### Zirel-3 Finetune This finetune was trained on a custom curated dataset designed to enhance the model's overall capabilities across multiple domains including instruction following, reasoning, and domain-specific knowledge. The training process builds upon the strong foundation of the REAP-compressed GLM-4.5-Air base model. ## Model Specifications - **Total Parameters**: 82B parameters (12 active) - **Architecture**: Sparse Mixture-of-Experts (SMoE) - **Context Length**: 128K tokens - **Precision**: BF16/FP16 compatible - **License**: MIT ## Usage ### Installation ```bash pip install transformers torch vllm ``` ### Inference with Transformers ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch # Load model and tokenizer model_name = "Daemontatox/Zirel-3" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True ) # Prepare conversation messages = [ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "Explain the REAP pruning method in simple terms."} ] # Apply chat template text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) # Tokenize inputs = tokenizer(text, return_tensors="pt").to(model.device) # Generate outputs = model.generate( **inputs, max_new_tokens=512, temperature=0.7, top_p=0.9, do_sample=True, repetition_penalty=1.1 ) # Decode response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True) print(response) ``` ### Inference with vLLM (Recommended for Production) vLLM provides significantly faster inference with built-in optimizations for MoE models: ```bash # Serve the model vllm serve Daemontatox/Zirel-3 \ --tensor-parallel-size 8 \ --enable-expert-parallel \ --max-num-seqs 64 \ --dtype bfloat16 ``` **Python Client:** ```python from openai import OpenAI # Connect to vLLM server client = OpenAI( base_url="http://localhost:8000/v1", api_key="EMPTY" ) # Create completion response = client.chat.completions.create( model="Daemontatox/Zirel-3", messages=[ {"role": "system", "content": "You are a helpful AI assistant."}, {"role": "user", "content": "Write a Python function to implement binary search."} ], temperature=0.7, max_tokens=512 ) print(response.choices[0].message.content) ``` ### Streaming Response ```python from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer import torch from threading import Thread model_name = "Daemontatox/Zirel-3" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True ) messages = [ {"role": "user", "content": "Explain quantum computing"} ] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to(model.device) streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True, skip_prompt=True) generation_kwargs = dict( inputs=inputs['input_ids'], streamer=streamer, max_new_tokens=512, temperature=0.7, top_p=0.9, do_sample=True ) thread = Thread(target=model.generate, kwargs=generation_kwargs) thread.start() for new_text in streamer: print(new_text, end='', flush=True) ``` ### vLLM Advanced Configuration ```bash # Multi-GPU setup with expert parallelism vllm serve Daemontatox/Zirel-3 \ --tensor-parallel-size 8 \ --enable-expert-parallel \ --max-num-seqs 64 \ --max-model-len 32768 \ --dtype bfloat16 \ --gpu-memory-utilization 0.9 \ --swap-space 16 \ --disable-log-requests # For low memory situations vllm serve Daemontatox/Zirel-3 \ --tensor-parallel-size 4 \ --enable-expert-parallel \ --max-num-seqs 32 \ --max-model-len 16384 \ --dtype bfloat16 \ --gpu-memory-utilization 0.85 ``` ### Batch Processing Example ```python from transformers import AutoModelForCausalLM, AutoTokenizer import torch model_name = "Daemontatox/Zirel-3" tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, device_map="auto", trust_remote_code=True ) # Batch of prompts prompts = [ "Explain machine learning", "Write a sorting algorithm", "What is the capital of France?" ] # Convert to chat format conversations = [ [{"role": "user", "content": prompt}] for prompt in prompts ] # Apply chat template to all texts = [ tokenizer.apply_chat_template(conv, tokenize=False, add_generation_prompt=True) for conv in conversations ] # Tokenize with padding inputs = tokenizer( texts, return_tensors="pt", padding=True, truncation=True, max_length=2048 ).to(model.device) # Generate outputs = model.generate( **inputs, max_new_tokens=256, temperature=0.7, top_p=0.9, do_sample=True, pad_token_id=tokenizer.pad_token_id ) # Decode all responses responses = tokenizer.batch_decode(outputs, skip_special_tokens=True) for prompt, response in zip(prompts, responses): print(f"Q: {prompt}\nA: {response}\n{'-'*50}") ``` ## Limitations - This is a large MoE model requiring substantial compute resources - Performance may vary based on hardware and optimization settings - May inherit biases present in training data - Requires careful prompt engineering for optimal results ## Citation If you use this model, please cite both the base model and the REAP paper: ```bibtex @article{lasby2025reap, title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression}, author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan}, journal={arXiv preprint arXiv:2510.13999}, year={2025} } @misc{zirel3, title={Zirel-3: A Specialized Finetune of GLM-4.5-Air-REAP}, author={Daemontatox}, year={2025}, howpublished={\url{https://huggingface.co/Daemontatox/Zirel-3}} } ``` ## Acknowledgments This model builds upon: - **Cerebras Research** for the REAP compression method and GLM-4.5-Air-REAP base model - **Original GLM-4.5-Air** by Zhipu AI - The open-source AI community for tooling and infrastructure ## License MIT License - Same as the base model.