EAGLE3-Apertus-8B-Instruct-2509

An Eagle3 draft model for speculative decoding with swiss-ai/Apertus-8B-Instruct-2509.

Model Description

This is a lightweight draft model trained to accelerate inference of Apertus-8B-Instruct through speculative decoding. Eagle3 uses a single-layer architecture that predicts future tokens by leveraging the target model's hidden states.

Property	Value
Architecture	`LlamaForCausalLMEagle3`
Hidden Size	4096
Intermediate Size	21504
Attention Heads	32
KV Heads	8
Layers	1
Vocab Size	131,072
Draft Vocab Size	32,000
Precision	bfloat16
Parameters	~513M

Training Details

Framework: SpecForge
Target Model: swiss-ai/Apertus-8B-Instruct-2509
Epochs: 10
Batch Size: 1 per GPU
Learning Rate: 1e-4
Max Sequence Length: 4096
Hardware: 64 GPUs (16 nodes × 4 GPUs)
Precision: bfloat16

Training Data

The model was trained on ~375k samples of regenerated conversation data. The dataset consists of prompts from:

The responses were regenerated using Apertus-8B-Instruct-2509 to ensure the draft model learns from the target model's own output distribution.

See: thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509-Data

Usage

With vLLM

VLLM_USE_V1=1 vllm serve swiss-ai/Apertus-8B-Instruct-2509 \
    --speculative-config '{"model": "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509", "num_speculative_tokens": 3, "method": "eagle3"}'

Or in Python:

from vllm import LLM, SamplingParams

llm = LLM(
    model="swiss-ai/Apertus-8B-Instruct-2509",
    speculative_config={
        "model": "thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509",
        "num_speculative_tokens": 3,
        "method": "eagle3",
    },
)

sampling_params = SamplingParams(temperature=0.7, max_tokens=256)
outputs = llm.generate(["Hello, how are you?"], sampling_params)
print(outputs[0].outputs[0].text)

With SGLang

python -m sglang.launch_server \
    --model swiss-ai/Apertus-8B-Instruct-2509 \
    --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509 \
    --speculative-num-steps 5 \
    --speculative-eagle-topk 8 \
    --speculative-num-draft-tokens 32

Continue Training

To resume training from this checkpoint:

Clone SpecForge
Download the training dataset from thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509-Data
Download this checkpoint and place it in a subdirectory of your output directory (e.g., outputs/apertus-8b-eagle3/epoch_9_step_55000/)
Run with --resume (it will automatically find the last checkpoint in --output-dir):

NUM_GPUS=4
TP_SIZE=1

torchrun \
    --standalone \
    --nproc_per_node $NUM_GPUS \
    scripts/train_eagle3.py \
    --target-model-path swiss-ai/Apertus-8B-Instruct-2509 \
    --draft-model-config /path/to/configs/apertus-8b-eagle3.json \
    --train-data-path /path/to/merged_train_regen.jsonl \
    --output-dir /path/to/outputs/apertus-8b-eagle3 \
    --num-epochs 15 \
    --batch-size 1 \
    --tp-size $TP_SIZE \
    --learning-rate 1e-4 \
    --max-length 4096 \
    --chat-template apertus \
    --cache-dir /path/to/cache \
    --target-model-backend sglang \
    --resume

The --resume flag uses get_last_checkpoint() to automatically find the most recent checkpoint in the output directory.

License

Apache 2.0

Citation

If you use this model, please cite Eagle3:

@article{li2025eagle3,
  title={Eagle 3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
  author={Li, Yuhui and Wei, Fangyun and Zhang, Chao and Zhang, Hongyang},
  journal={arXiv preprint arXiv:2503.01840},
  year={2025}
}

Acknowledgments

Trained on the Alps supercomputer at CSCS (Swiss National Supercomputing Centre).

Downloads last month: 85

Safetensors

Model size

0.5B params

Tensor type

I64

BF16

BOOL

Model tree for thomaskiefer/EAGLE3-Apertus-8B-Instruct-2509

Base model

swiss-ai/Apertus-8B-2509

Finetuned

swiss-ai/Apertus-8B-Instruct-2509

Finetuned

(6)

this model