EAGLE3-Apertus-8B-Instruct-2509

An Eagle3 draft model for speculative decoding with swiss-ai/Apertus-8B-Instruct-2509.

Model Description

This is a lightweight draft model trained to accelerate inference of Apertus-8B-Instruct through speculative decoding. Eagle3 uses a single-layer architecture that predicts future tokens by leveraging the target model's hidden states.

Property Value
Architecture LlamaForCausalLMEagle3
Hidden Size 4096
Intermediate Size 21504
Attention Heads 32
KV Heads 8
Layers 1
Vocab Size 131,072
Draft Vocab Size 32,000
Precision bfloat16
Parameters ~513M

Training Details

  • Framework: SpecForge
  • Target Model: swiss-ai/Apertus-8B-Instruct-2509
  • Epochs: 10
  • Batch Size: 1 per GPU
  • Learning Rate: 1e-4
  • Max Sequence Length: 4096
  • Hardware: 64 GPUs (16 nodes 脳 4 GPUs)
  • Precision: bfloat16

Training Data

The model was trained on ~375k samples of regenerated conversation data. The dataset consists of prompts from:

The responses were regenerated using Apertus-8B-Instruct-2509 to ensure the draft model learns from the target model's own output distribution.

See: thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509-Data

Usage

With vLLM

VLLM_USE_V1=1 vllm serve swiss-ai/Apertus-8B-Instruct-2509 \
    --speculative-config '{"model": "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509", "num_speculative_tokens": 3, "method": "eagle3"}'

Or in Python:

from vllm import LLM, SamplingParams

llm = LLM(
    model="swiss-ai/Apertus-8B-Instruct-2509",
    speculative_config={
        "model": "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509",
        "num_speculative_tokens": 3,
        "method": "eagle3",
    },
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Hello, how are you?"], sampling_params)
print(outputs[0].outputs[0].text)

With SGLang

python -m sglang.launch_server \
    --model swiss-ai/Apertus-8B-Instruct-2509 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509 \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 8 \
    --speculative-num-draft-tokens 32

Continue Training

To resume training from this checkpoint:

  1. Clone SpecForge
  2. Download the training dataset from thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509-Data
  3. Download this checkpoint and place it in a subdirectory of your output directory (e.g., outputs/apertus-8b-eagle3/epoch_9_step_55000/)
  4. Run with --resume (it will automatically find the last checkpoint in --output-dir):
NUM_GPUS=4
TP_SIZE=1

torchrun \
    --standalone \
    --nproc_per_node $NUM_GPUS \
    scripts/train_eagle3.py \
    --target-model-path swiss-ai/Apertus-8B-Instruct-2509 \
    --draft-model-config /path/to/configs/apertus-8b-eagle3.json \
    --train-data-path /path/to/merged_train_regen.jsonl \
    --output-dir /path/to/outputs/apertus-8b-eagle3 \
    --num-epochs 15 \
    --batch-size 1 \
    --tp-size $TP_SIZE \
    --learning-rate 1e-4 \
    --max-length 4096 \
    --chat-template apertus \
    --cache-dir /path/to/cache \
    --target-model-backend sglang \
    --resume

The --resume flag uses get_last_checkpoint() to automatically find the most recent checkpoint in the output directory.

License

Apache 2.0

Citation

If you use this model, please cite Eagle3:

@article{li2025eagle3,
  title={Eagle 3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
  author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
  journal={arXiv preprint arXiv:2503.01840},
  year={2025}
}

Acknowledgments

Trained on the Alps supercomputer at CSCS (Swiss National Supercomputing Centre).

Downloads last month
85
Safetensors
Model size
0.5B params
Tensor type
I64
BF16
BOOL
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509

Finetuned
(6)
this model