LLaDA2.0-flash-CAP
LLaDA2.0-flash-CAP is an enhanced version of LLaDA2.0-flash that incorporates Confidence-Aware Parallel (CAP) Training for significantly improved inference efficiency. Built upon the 100B-A6B Mixture-of-Experts (MoE) diffusion architecture, this model achieves faster parallel decoding while maintaining strong performance across diverse benchmarks.
π Performance Comparison
Efficiency vs. Quality Trade-off
| Model | Average Score | Tokens/Forward (TPF) | Speedup |
|---|---|---|---|
| LLaDA2.0-flash | 78.57 | 3.19 | 1.0Γ |
| LLaDA2.0-flash-CAP | 76.85 | 4.65 | 1.46Γ |
Evaluated on 12 diverse benchmarks covering knowledge, reasoning, coding, and mathematics.
Key Insights
- 1.46Γ faster generation with only a 1.72% performance trade-off
- Ideal for latency-sensitive applications requiring real-time responses
- Maintains competitive accuracy across all task categories
π¬ What is CAP Training?
Confidence-Aware Parallel (CAP) Training is a novel training technique designed to enhance parallel decoding efficiency in diffusion language models.
Technical Overview
The training objective combines two complementary losses:
L(ΞΈ) = L_SFT(ΞΈ) + Ξ»L_conf(ΞΈ)
Where:
- L_SFT: Supervised fine-tuning loss ensuring prediction correctness
- L_conf: Confidence loss that minimizes entropy only for correctly predicted tokens
- Ξ»: Hyperparameter balancing the two objectives
Why CAP Works
- Sharpens Correct Predictions: While standard training ensures correctness, it provides diminishing incentive to increase confidence on already-correct tokens. CAP explicitly optimizes for high-confidence predictions.
- Enables Aggressive Parallelism: Higher confidence allows the model to decode multiple tokens simultaneously with greater reliability, reducing the total number of forward passes needed.
- Selective Optimization: By focusing only on correct predictions, CAP avoids penalizing the model's exploration of uncertain outputs.
π¦ Model Variants
| Model ID | Description | Hugging Face Link |
|---|---|---|
inclusionAI/LLaDA2.0-flash-CAP |
CAP-enhanced model optimized for fast inference | π€ Model Card |
inclusionAI/LLaDA2.0-flash |
Base instruction-tuned model | π€ Model Card |
π Model Overview
LLaDA2.0-flash-CAP inherits the architecture of LLaDA2.0-flash:
- Type: Mixture-of-Experts (MoE) Diffusion Language Model
- Total Parameters (Non-Embedding): 100B
- Number of Layers: 32
- Attention Heads: 32
- Context Length: 32,768 tokens
- Position Embedding: Rotary (RoPE)
- Vocabulary Size: 157,184
- Training Enhancement: Confidence-Aware Parallel (CAP) Training
π» Usage
π€ Hugging Face Transformers
import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
model_path = "/path/to/LLaDA2.0-flash-CAP"
device = "cuda:0"
model = AutoModelForCausalLM.from_pretrained(
model_path, trust_remote_code=True, device_map=device
)
model = model.to(torch.bfloat16)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
prompt = "Why does Camus think that Sisyphus is happy?"
input_ids = tokenizer.apply_chat_template(
[{"role": "user", "content": prompt}],
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
)
generated_tokens = model.generate(
inputs=input_ids,
eos_early_stop=True,
gen_length=512,
block_length=32,
steps=32,
temperature=0.0,
)
generated_answer = tokenizer.decode(
generated_tokens[0],
skip_special_tokens=True,
)
print(generated_answer)
Best Practices
To achieve optimal performance, we recommend the following settings:
- Sampling Parameters:
We suggest usingTemperature=0.0,block_length=32, andsteps=32. Using a higher temperature value may occasionally result in language mixing and a slight decrease in model performance. - Adequate Output Length:
We recommend using an output length of 32768 tokens for most queries.
π License
This project is licensed under the terms of the Apache License 2.0.
π€ Contact & Collaboration
For questions, collaborations, or feedback, please reach out via Hugging Face or open an issue in the repository.
π Join us in advancing open, efficient, and intelligent language models!
- Downloads last month
- 22