YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Qwen3-21B Pruned from 30B (90 Experts)

A pruned version of Qwen3-30B-A3B-Instruct with 38 experts removed through expert pruning, reducing the model from 30B to approximately 21B parameters while achieving significant memory savings.

Model Details

  • Base Model: Qwen/Qwen3-30B-A3B-Instruct-2507-FP8
  • Architecture: Mixture of Experts (MoE) Transformer
  • Original Parameters: ~30B
  • Pruned Parameters: ~21B
  • Original Experts: 128 per layer
  • Pruned Experts: 90 per layer (38 removed)
  • Size Reduction: 28.2% parameter reduction
  • Quality Impact: +7.36% performance change

Pruning Methodology

Expert Usage Analysis

Used real-time router logit analysis to identify the least utilized experts across the model:

  • Analyzed expert routing patterns with output_router_logits=True
  • Tracked expert selection frequency across multiple inference samples
  • Identified 38 least-used experts for removal based on actual usage statistics

True Architectural Pruning

Unlike weight masking approaches, this model features genuine architectural changes:

  • In-place expert removal: deleted unused expert modules
  • Router adjustment: Reduced router dimensions from 128โ†’90 outputs
  • Weight remapping: Preserved routing weights for remaining experts
  • Config updates: Model configuration reflects new expert count

Quality Impact

  • Performance Impact: 7.36% degradation on evaluation metrics
  • Note: Performance may vary across different task types
  • Efficiency Gains: Faster inference due to reduced expert overhead

Technical Specifications

Architecture:
  - Layers: 48
  - Hidden Size: 2048
  - Attention Heads: 32
  - Experts per Layer: 90 (reduced from 128)
  - Active Experts per Token: 8
  - Context Length: 128K
  - Effective Parameters: ~21B (reduced from ~30B)

Optimizations:
  - FP8 quantization preserved
  - SafeTensors format
  - Flash Attention compatible
  - Efficient expert routing
  - True architectural pruning


| Metric | Original | Pruned | Change |
|--------|----------|--------|--------|
| Total Parameters | ~30B | ~21B | -28.2% |
| Model Size | 56.9 GB | 40.8 GB | -16.0 GB |
| Experts per Layer | 128 | 90 | -38 |
| Evaluation Loss | Baseline | +7.36% | Probable degradation |

@misc{qwen3-pruned-90, title={Qwen3-21B Pruned Architecture with 90 Experts}, author={ Expert Pruning}, year={2025}, note={Pruned version of Qwen3-30B-A3B-Instruct with 38 experts removed} }

Downloads last month
5
Safetensors
Model size
22B params
Tensor type
F32
ยท
F8_E4M3
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support