---
base_model: cerebras/GLM-4.5-Air-REAP-82B-A12B
tags:
- text-generation-inference
- transformers
- unsloth
- glm4_moe
- pruning
- REAP
- MOE
- glm
- zirel
license: apache-2.0
language:
- en
library_name: transformers
---
# Daemontatox/Zirel-3

## Model Description

**Zirel-3** is a specialized finetune of [cerebras/GLM-4.5-Air-REAP-82B-A12B](https://huggingface.co/cerebras/GLM-4.5-Air-REAP-82B-A12B), a memory-efficient 82B active parameter Mixture-of-Experts (MoE) model compressed using the novel REAP (Router-weighted Expert Activation Pruning) technique.

### Base Model: GLM-4.5-Air-REAP-82B-A12B

The base model is a compressed variant of GLM-4.5-Air that:
- Maintains **near-identical performance** while being **25% lighter** (compressed from 110B to 82B total parameters)
- Uses **82B parameters** (~12B activated per forward pass)
- Employs the **REAP pruning method** which outperforms expert merging, especially on generative tasks
- Retains full capabilities: code generation, agentic workflows, repository-scale understanding, and function calling
- Achieves **drop-in compatibility** with vanilla vLLM (no custom patches required)

### REAP Technology

REAP (Router-weighted Expert Activation Pruning) is a one-shot MoE compression method that:
- Prunes low-impact experts based on router gate values and expert activation norms
- Preserves the router's independent control over remaining experts
- Significantly outperforms expert merging on generative benchmarks (code, creative writing, math)
- Maintains 95-97% of baseline model quality even at high compression ratios

**Paper**: [REAP the Experts: Why Pruning Prevails for One-Shot MoE compression](https://arxiv.org/abs/2510.13999) (Lasby et al., 2025)

### Zirel-3 Finetune

This finetune was trained on a custom curated dataset designed to enhance the model's overall capabilities across multiple domains including instruction following, reasoning, and domain-specific knowledge. The training process builds upon the strong foundation of the REAP-compressed GLM-4.5-Air base model.

## Model Specifications

- **Total Parameters**: 82B parameters (12 active)
- **Architecture**: Sparse Mixture-of-Experts (SMoE)
- **Context Length**: 128K tokens
- **Precision**: BF16/FP16 compatible
- **License**: MIT

## Usage

### Installation

```bash
pip install transformers torch vllm
```

### Inference with Transformers

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Load model and tokenizer
model_name = "Daemontatox/Zirel-3"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Prepare conversation
messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain the REAP pruning method in simple terms."}
]

# Apply chat template
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)

# Tokenize
inputs = tokenizer(text, return_tensors="pt").to(model.device)

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    repetition_penalty=1.1
)

# Decode
response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
print(response)
```

### Inference with vLLM (Recommended for Production)

vLLM provides significantly faster inference with built-in optimizations for MoE models:

```bash
# Serve the model
vllm serve Daemontatox/Zirel-3 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --max-num-seqs 64 \
    --dtype bfloat16
```

**Python Client:**

```python
from openai import OpenAI

# Connect to vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="EMPTY"
)

# Create completion
response = client.chat.completions.create(
    model="Daemontatox/Zirel-3",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant."},
        {"role": "user", "content": "Write a Python function to implement binary search."}
    ],
    temperature=0.7,
    max_tokens=512
)

print(response.choices[0].message.content)
```

### Streaming Response

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
import torch
from threading import Thread

model_name = "Daemontatox/Zirel-3"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

messages = [
    {"role": "user", "content": "Explain quantum computing"}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True, skip_prompt=True)

generation_kwargs = dict(
    inputs=inputs['input_ids'],
    streamer=streamer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()

for new_text in streamer:
    print(new_text, end='', flush=True)
```

### vLLM Advanced Configuration

```bash
# Multi-GPU setup with expert parallelism
vllm serve Daemontatox/Zirel-3 \
    --tensor-parallel-size 8 \
    --enable-expert-parallel \
    --max-num-seqs 64 \
    --max-model-len 32768 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.9 \
    --swap-space 16 \
    --disable-log-requests

# For low memory situations
vllm serve Daemontatox/Zirel-3 \
    --tensor-parallel-size 4 \
    --enable-expert-parallel \
    --max-num-seqs 32 \
    --max-model-len 16384 \
    --dtype bfloat16 \
    --gpu-memory-utilization 0.85
```

### Batch Processing Example

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "Daemontatox/Zirel-3"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Batch of prompts
prompts = [
    "Explain machine learning",
    "Write a sorting algorithm",
    "What is the capital of France?"
]

# Convert to chat format
conversations = [
    [{"role": "user", "content": prompt}] for prompt in prompts
]

# Apply chat template to all
texts = [
    tokenizer.apply_chat_template(conv, tokenize=False, add_generation_prompt=True)
    for conv in conversations
]

# Tokenize with padding
inputs = tokenizer(
    texts,
    return_tensors="pt",
    padding=True,
    truncation=True,
    max_length=2048
).to(model.device)

# Generate
outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.7,
    top_p=0.9,
    do_sample=True,
    pad_token_id=tokenizer.pad_token_id
)

# Decode all responses
responses = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for prompt, response in zip(prompts, responses):
    print(f"Q: {prompt}\nA: {response}\n{'-'*50}")
```


## Limitations

- This is a large MoE model requiring substantial compute resources
- Performance may vary based on hardware and optimization settings
- May inherit biases present in training data
- Requires careful prompt engineering for optimal results

## Citation

If you use this model, please cite both the base model and the REAP paper:

```bibtex
@article{lasby2025reap,
  title={REAP the Experts: Why Pruning Prevails for One-Shot MoE compression},
  author={Lasby, Mike and Lazarevich, Ivan and Sinnadurai, Nish and Lie, Sean and Ioannou, Yani and Thangarasa, Vithursan},
  journal={arXiv preprint arXiv:2510.13999},
  year={2025}
}

@misc{zirel3,
  title={Zirel-3: A Specialized Finetune of GLM-4.5-Air-REAP},
  author={Daemontatox},
  year={2025},
  howpublished={\url{https://huggingface.co/Daemontatox/Zirel-3}}
}
```

## Acknowledgments

This model builds upon:
- **Cerebras Research** for the REAP compression method and GLM-4.5-Air-REAP base model
- **Original GLM-4.5-Air** by Zhipu AI
- The open-source AI community for tooling and infrastructure

## License

MIT License - Same as the base model.