Assisted decoding

Assisted decoding speeds up text generation by allowing a helper propose candidate tokens before the main model commits to them. The main model verifies the candidate tokens in one forward pass. The helper is fast and cheap and can replace dozens of more expensive forward passes by the main model.

This guide covers assisted decoding methods in Transformers.

Speculative decoding

Speculative decoding uses a smaller assistant model to draft candidate tokens. The main model checks these tokens in one pass. Validated tokens enter the final output and rejected tokens trigger standard sampling. Generation is faster because the main model runs fewer expensive forward passes.

The method works best when the assistant model is significantly smaller than the main model and uses the same tokenizer. Speculative decoding supports greedy search and sampling but not batched inputs.

Pass assistant_model to generate(). Set do_sample=True to resample if token validation fails.

greedy search

sampling

Prompt lookup decoding

Prompt lookup decoding doesn’t need an assistant model. It finds overlapping n-grams in the prompt to propose candidate tokens. If no match exists, it falls back to normal autoregressive decoding. This suits input-grounded tasks like summarization and translation because candidate tokens often mirror local patterns in the source text.

Pass prompt_lookup_num_tokens to generate(). This sets how many tokens the algorithm tries to copy from earlier in the prompt when it detects a repeated pattern.

greedy decoding

sampling

Self-speculative decoding

Self-speculative decoding uses a model’s intermediate layers as the assistant to propose candidate tokens. If the proposal matches, the model exits early and the remaining layers verify or correct the tokens.

Because it’s all one model, weights and caches are shared, which boosts speed without extra memory overhead. This technique only works for models trained to support early-exit logits from intermediate layers.

Pass assistant_early_exit to generate() to set the exit layer.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("facebook/layerskip-llama3.2-1B")
model = AutoModelForCausalLM.from_pretrained("facebook/layerskip-llama3.2-1B", dtype="auto")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt")

outputs = model.generate(**inputs, assistant_early_exit=4, do_sample=False, max_new_tokens=20)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Universal assisted decoding

Universal assisted decoding (UAD) makes speculative decoding possible even when the main and assistant models have different tokenizers. It lets you pair any small assistant model with the main model. Candidate tokens are re-encoded and the algorithm computes the longest common subsequence so the continuation stays aligned.

Pass tokenizer, assistant_tokenizer, and assistant_model to generate() to enable UAD.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

assistant_tokenizer = AutoTokenizer.from_pretrained("double7/vicuna-68m")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b", dtype="auto")
assistant_model = AutoModelForCausalLM.from_pretrained("double7/vicuna-68m", dtype="auto")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt")

outputs = model.generate(**inputs, assistant_model=assistant_model, tokenizer=tokenizer, assistant_tokenizer=assistant_tokenizer)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
'Hugging Face is an open-source company that is dedicated to creating a better world through technology.'

Resources

Read the Assisted Generation: a new direction toward low-latency text generation blog post for more context about text generation latency and assisted generation.

Update on GitHub