LLaDA2.0-flash-CAP

LLaDA2.0-flash-CAP is an enhanced version of LLaDA2.0-flash that incorporates Confidence-Aware Parallel (CAP) Training for significantly improved inference efficiency. Built upon the 100B-A6B Mixture-of-Experts (MoE) diffusion architecture, this model achieves faster parallel decoding while maintaining strong performance across diverse benchmarks.

📊 Performance Comparison

Efficiency vs. Quality Trade-off

Model	Average Score	Tokens/Forward (TPF)	Speedup
LLaDA2.0-flash	78.57	3.19	1.0×
LLaDA2.0-flash-CAP	76.85	4.65	1.46×

Evaluated on 12 diverse benchmarks covering knowledge, reasoning, coding, and mathematics.

Key Insights

1.46× faster generation with only a 1.72% performance trade-off
Ideal for latency-sensitive applications requiring real-time responses
Maintains competitive accuracy across all task categories

🔬 What is CAP Training?

Confidence-Aware Parallel (CAP) Training is a novel training technique designed to enhance parallel decoding efficiency in diffusion language models.

Technical Overview

The training objective combines two complementary losses:

L(θ) = L_SFT(θ) + λL_conf(θ)

Where:

L_SFT: Supervised fine-tuning loss ensuring prediction correctness
L_conf: Confidence loss that minimizes entropy only for correctly predicted tokens
λ: Hyperparameter balancing the two objectives

Why CAP Works

Sharpens Correct Predictions: While standard training ensures correctness, it provides diminishing incentive to increase confidence on already-correct tokens. CAP explicitly optimizes for high-confidence predictions.
Enables Aggressive Parallelism: Higher confidence allows the model to decode multiple tokens simultaneously with greater reliability, reducing the total number of forward passes needed.
Selective Optimization: By focusing only on correct predictions, CAP avoids penalizing the model's exploration of uncertain outputs.

📦 Model Variants

Model ID	Description	Hugging Face Link
`inclusionAI/LLaDA2.0-flash-CAP`	CAP-enhanced model optimized for fast inference	🤗 Model Card
`inclusionAI/LLaDA2.0-flash`	Base instruction-tuned model	🤗 Model Card

🔍 Model Overview

LLaDA2.0-flash-CAP inherits the architecture of LLaDA2.0-flash:

Type: Mixture-of-Experts (MoE) Diffusion Language Model
Total Parameters (Non-Embedding): 100B
Number of Layers: 32
Attention Heads: 32
Context Length: 32,768 tokens
Position Embedding: Rotary (RoPE)
Vocabulary Size: 157,184
Training Enhancement: Confidence-Aware Parallel (CAP) Training

💻 Usage

🤗 Hugging Face Transformers

import torch
import torch.nn.functional as F
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer

model_path = "/path/to/LLaDA2.0-flash-CAP"
device = "cuda:0"
model = AutoModelForCausalLM.from_pretrained(
    model_path, trust_remote_code=True, device_map=device
)
model = model.to(torch.bfloat16)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

prompt = "Why does Camus think that Sisyphus is happy?"
input_ids = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
)
generated_tokens = model.generate(
    inputs=input_ids,
    eos_early_stop=True,
    gen_length=512,
    block_length=32,
    steps=32,
    temperature=0.0,
)
generated_answer = tokenizer.decode(
    generated_tokens[0],
    skip_special_tokens=True,
)
print(generated_answer)

Best Practices

To achieve optimal performance, we recommend the following settings:

Sampling Parameters:
We suggest using Temperature=0.0, block_length=32, and steps=32. Using a higher temperature value may occasionally result in language mixing and a slight decrease in model performance.
Adequate Output Length:
We recommend using an output length of 32768 tokens for most queries.

🌐 License

This project is licensed under the terms of the Apache License 2.0.

🤝 Contact & Collaboration

For questions, collaborations, or feedback, please reach out via Hugging Face or open an issue in the repository.

👉 Join us in advancing open, efficient, and intelligent language models!

Downloads last month: 22

Safetensors

Model size

103B params

Tensor type

BF16

Collection including inclusionAI/LLaDA2.0-flash-CAP

LLaDA 2.0

Collection

6 items • Updated 2 days ago • 25