t5-moe-xsum_hash

Model Description

T5 with MoE (Hash Routing) fine-tuned on the XSUM dataset for abstractive summarization.

Architecture

This model uses Sparse Mixture of Experts with deterministic hash-based routing.

Key Features:

Hash-based routing (deterministic, no learned gating)
8 expert networks per layer
Efficient and reproducible routing

Model Configuration

Base Model: google-t5/t5-small
Total Parameters: 236,667,392
Trainable Parameters: 236,667,392
Number of Experts: 8
Top-k: 2
Routing Strategy: hash
Load Balancing: Enabled

Training Data

The model was trained on the XSUM dataset, which contains:

~204k training examples
~11k validation examples
~11k test examples

Each example consists of a BBC news article and a one-sentence summary.

Usage

from transformers import T5Tokenizer

# Load tokenizer
tokenizer = T5Tokenizer.from_pretrained("YOUR_USERNAME/t5-moe-xsum_hash")

# Note: For MoE models, you need to reconstruct the architecture
# See the model repository for detailed loading instructions

Evaluation

Evaluate using standard ROUGE metrics and SummaC consistency scores.

Training Procedure

The model was trained using:

AdamW optimizer with weight decay
Learning rate: 5e-5
Warmup steps: 500
Mixed precision (FP16) training
Gradient accumulation for larger effective batch size

Limitations

Trained only on English news articles
May not generalize well to other domains
MoE models require custom loading code

Citation

If you use this model, please cite the XSUM dataset:

@inproceedings{narayan-etal-2018-dont,
    title = "Don{'}t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization",
    author = "Narayan, Shashi and Cohen, Shay B. and Lapata, Mirella",
    booktitle = "Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
    year = "2018",
}

Downloads last month: 3

Safetensors

Model size

0.2B params

Tensor type

F32

Syd-J
/

t5-moe-hash-xsum