German MoE GPT v8 - OPUS EDITION

A research-grade language model with state-of-the-art Mixture-of-Experts (MoE) architecture, trained on consumer hardware (RTX 4090). This implementation follows best practices from recent MoE research (ST-MoE, Switch Transformer) while maintaining full cross-platform compatibility.

Note: While this model was trained on German data, the architecture is language-agnostic and can be used for any language dataset. Simply replace the training corpus with your target language data.

Model Description

This is a 149.6M parameter Mixture-of-Experts (MoE) language model trained on high-quality German text data. The model uses a hybrid architecture combining dense and sparse (MoE) layers for optimal parameter efficiency.

Key Features

  • 🏗️ Hybrid Dense + MoE Architecture: Every 2nd layer uses MoE for efficiency
  • 🔬 Research-Backed: Implements ST-MoE and Switch Transformer best practices
  • Efficient: Only ~33% of parameters active per token
  • 🖥️ Cross-Platform: Pure PyTorch, runs on Windows/Linux/macOS
  • 🤗 HuggingFace Compatible: Full integration with transformers library

Model Specifications

Specification Value
Total Parameters 149.6M
Active Parameters per Token 49.9M (33%)
Vocabulary Size 128,256 (Llama 3.2 Tokenizer)
Context Length 2048 tokens
Architecture Hybrid Dense + MoE Transformer
Layers 12
Hidden Size 768
Attention Heads 12
Experts per MoE Layer 32
Active Experts (Top-k) 2
Position Embeddings RoPE (Rotary Position Embeddings)

Training Data

The model was trained on a 17.4 GB curated German corpus consisting of:

  • Clean German Wikipedia (~11 GB): Encyclopedic knowledge
  • OpenSubtitles (German): Natural dialog and conversational language
  • Belletristik: German literature for style and creativity

Data Quality: Deduplicated and SEO spam filtered for high-quality training signal.

Adapting to other languages: The architecture is language-agnostic. Replace the dataset with your target language corpus and retrain.

Training Details

Training Hyperparameters

  • Steps: 300,000
  • Batch Size: 32 (with gradient accumulation)
  • Learning Rate: 3e-4 (max)
  • Hardware: Single RTX 4090 (24GB VRAM)
  • Training Time: ~120 hours
  • Precision: Mixed (BF16)

Results

Metric Initial Final Improvement
Training Loss 12.0 2.55 79% ↓
Validation Loss 4.58 2.40 48% ↓
Perplexity - 11.0 -

Usage

Installation

pip install transformers torch

Quick Start

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained("arnomatic/german-moe-gpt-v8-pretrained")
tokenizer = AutoTokenizer.from_pretrained("arnomatic/german-moe-gpt-v8-pretrained")

# Generate text
prompt = "Die Hauptstadt von Deutschland ist"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_new_tokens=100,
    temperature=0.8,
    top_k=50,
    top_p=0.9,
    do_sample=True
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Advanced Usage

# Generate with custom parameters
outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.7,          # Lower = more deterministic
    top_k=40,                 # Top-k sampling
    top_p=0.95,               # Nucleus sampling
    repetition_penalty=1.1,   # Reduce repetition
    do_sample=True
)

Technical Architecture

MoE Layer Design

The model uses a Noisy Top-k Router with the following components:

  1. Gate Computation: Learned routing weights per expert
  2. Noise Injection: Adds controlled noise during training for exploration
  3. Top-k Selection: Routes each token to the 2 best experts
  4. Capacity Management: Prevents expert overload with dynamic capacity limits
  5. Load Balancing: Auxiliary loss ensures uniform expert utilization

Loss Functions

The training loss combines three components:

L_total = L_ce + α * L_aux + β * L_z
  • L_ce: Cross-entropy language modeling loss
  • L_aux: Load balance loss (α = 0.01) for uniform expert utilization
  • L_z: Router z-loss (β = 0.001) for numerical stability

Attention Mechanism

  • RoPE (Rotary Position Embeddings) for position encoding
  • PyTorch SDPA with automatic backend selection (Flash Attention when available)
  • Causal masking for autoregressive generation

Optimizations

  • Gradient Checkpointing: ~40% VRAM reduction
  • Mixed Precision (BF16): 2x faster training
  • Weight Tying: LM head shares embeddings
  • Batch Expert Processing: Parallel computation for all experts

Limitations and Biases

  • Language: Primarily trained on German text
  • Domain: General domain (Wikipedia, literature, subtitles)
  • Biases: May reflect biases present in training data
  • Context: Limited to 2048 tokens
  • Compute: Requires GPU for efficient inference

Ethical Considerations

This model is a language model and can generate text that may be:

  • Factually incorrect
  • Biased or stereotypical
  • Inappropriate or offensive

Users should:

  • Verify generated content for factual accuracy
  • Be aware of potential biases
  • Use appropriate content filtering for production applications

Citation

If you use this model in your research, please cite:

@misc{german-moe-gpt-v8,
  title={German MoE GPT v8: A Research-Grade Mixture-of-Experts Language Model},
  author={[Your Name]},
  year={2025},
  howpublished={\url{https://huggingface.co/arnomatic/german-moe-gpt-v8-pretrained}}
}

References

This implementation is based on:

License

MIT License - See LICENSE file for details

Acknowledgments

  • HuggingFace Transformers team for the excellent framework
  • PyTorch team for SDPA and optimized operations
  • nanoGPT/nanoMoE community for inspiration

Model Card Contact

For questions or feedback, please open an issue in the GitHub repository.

Downloads last month
137
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train arnomatic/german-moe-gpt-v8-pretrained