German MoE GPT v8 - OPUS EDITION

A research-grade language model with state-of-the-art Mixture-of-Experts (MoE) architecture, trained on consumer hardware (RTX 4090). This implementation follows best practices from recent MoE research (ST-MoE, Switch Transformer) while maintaining full cross-platform compatibility.

Note: While this model was trained on German data, the architecture is language-agnostic and can be used for any language dataset. Simply replace the training corpus with your target language data.

Model Description

This is a 149.6M parameter Mixture-of-Experts (MoE) language model trained on high-quality German text data. The model uses a hybrid architecture combining dense and sparse (MoE) layers for optimal parameter efficiency.

Key Features

🏗️ Hybrid Dense + MoE Architecture: Every 2nd layer uses MoE for efficiency
🔬 Research-Backed: Implements ST-MoE and Switch Transformer best practices
⚡ Efficient: Only ~33% of parameters active per token
🖥️ Cross-Platform: Pure PyTorch, runs on Windows/Linux/macOS
🤗 HuggingFace Compatible: Full integration with transformers library

Model Specifications

Specification	Value
Total Parameters	149.6M
Active Parameters per Token	~~49.9M (~~33%)
Vocabulary Size	128,256 (Llama 3.2 Tokenizer)
Context Length	2048 tokens
Architecture	Hybrid Dense + MoE Transformer
Layers	12
Hidden Size	768
Attention Heads	12
Experts per MoE Layer	32
Active Experts (Top-k)	2
Position Embeddings	RoPE (Rotary Position Embeddings)

Training Data

The model was trained on a 17.4 GB curated German corpus consisting of:

Clean German Wikipedia (~11 GB): Encyclopedic knowledge
OpenSubtitles (German): Natural dialog and conversational language
Belletristik: German literature for style and creativity

Data Quality: Deduplicated and SEO spam filtered for high-quality training signal.

Adapting to other languages: The architecture is language-agnostic. Replace the dataset with your target language corpus and retrain.

Training Details

Training Hyperparameters

Steps: 300,000
Batch Size: 32 (with gradient accumulation)
Learning Rate: 3e-4 (max)
Hardware: Single RTX 4090 (24GB VRAM)
Training Time: ~120 hours
Precision: Mixed (BF16)

Results

Metric	Initial	Final	Improvement
Training Loss	12.0	2.55	79% ↓
Validation Loss	4.58	2.40	48% ↓
Perplexity	-	11.0	-

Usage

Installation

pip install transformers torch

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("arnomatic/german-moe-gpt-v8-pretrained")
tokenizer = AutoTokenizer.from_pretrained("arnomatic/german-moe-gpt-v8-pretrained")

# Generate text
prompt = "Die Hauptstadt von Deutschland ist"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.8,
    top_k=50,
    top_p=0.9,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Advanced Usage

# Generate with custom parameters
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7,          # Lower = more deterministic
    top_k=40,                 # Top-k sampling
    top_p=0.95,               # Nucleus sampling
    repetition_penalty=1.1,   # Reduce repetition
    do_sample=True
)

Technical Architecture

MoE Layer Design

The model uses a Noisy Top-k Router with the following components:

Gate Computation: Learned routing weights per expert
Noise Injection: Adds controlled noise during training for exploration
Top-k Selection: Routes each token to the 2 best experts
Capacity Management: Prevents expert overload with dynamic capacity limits
Load Balancing: Auxiliary loss ensures uniform expert utilization

Loss Functions

The training loss combines three components:

L_total = L_ce + α * L_aux + β * L_z

L_ce: Cross-entropy language modeling loss
L_aux: Load balance loss (α = 0.01) for uniform expert utilization
L_z: Router z-loss (β = 0.001) for numerical stability

Attention Mechanism

RoPE (Rotary Position Embeddings) for position encoding
PyTorch SDPA with automatic backend selection (Flash Attention when available)
Causal masking for autoregressive generation

Optimizations

✅ Gradient Checkpointing: ~40% VRAM reduction
✅ Mixed Precision (BF16): 2x faster training
✅ Weight Tying: LM head shares embeddings
✅ Batch Expert Processing: Parallel computation for all experts

Limitations and Biases

Language: Primarily trained on German text
Domain: General domain (Wikipedia, literature, subtitles)
Biases: May reflect biases present in training data
Context: Limited to 2048 tokens
Compute: Requires GPU for efficient inference

Ethical Considerations

This model is a language model and can generate text that may be:

Factually incorrect
Biased or stereotypical
Inappropriate or offensive

Users should:

Verify generated content for factual accuracy
Be aware of potential biases
Use appropriate content filtering for production applications

Citation

If you use this model in your research, please cite:

@misc{german-moe-gpt-v8,
  title={German MoE GPT v8: A Research-Grade Mixture-of-Experts Language Model},
  author={[Your Name]},
  year={2025},
  howpublished={\url{https://huggingface.co/arnomatic/german-moe-gpt-v8-pretrained}}
}

References

This implementation is based on:

ST-MoE: Zoph et al. (2022) - Designing Effective Sparse Expert Models
Switch Transformer: Fedus et al. (2022) - Switch Transformers: Scaling to Trillion Parameter Models
RoFormer: Su et al. (2021) - RoFormer: Enhanced Transformer with Rotary Position Embedding

License

MIT License - See LICENSE file for details

Acknowledgments

HuggingFace Transformers team for the excellent framework
PyTorch team for SDPA and optimized operations
nanoGPT/nanoMoE community for inspiration

Model Card Contact

For questions or feedback, please open an issue in the GitHub repository.

Downloads last month: 137

Safetensors

Model size

0.1B params

Tensor type

F32

Dataset used to train arnomatic/german-moe-gpt-v8-pretrained

Evaluation results

Metadata error: specify a dataset to view leaderboard