Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth
Abstract
Any-Depth Alignment (ADA) is an inference-time defense that enhances the safety of Large Language Models (LLMs) by reintroducing alignment tokens mid-stream, ensuring robust protection against adversarial attacks without altering the model's parameters.
Large Language Models (LLMs) exhibit strong but shallow alignment: they directly refuse harmful queries when a refusal is expected at the very start of an assistant turn, yet this protection collapses once a harmful continuation is underway (either through the adversarial attacks or via harmful assistant-prefill attacks). This raises a fundamental question: Can the innate shallow alignment in LLMs be unlocked to ensure safety at arbitrary generation depths? To achieve this goal, we propose Any-Depth Alignment (ADA), an effective inference-time defense with negligible overhead. ADA is built based on our observation that alignment is concentrated in the assistant header tokens through repeated use in shallow-refusal training, and these tokens possess the model's strong alignment priors. By reintroducing these tokens mid-stream, ADA induces the model to reassess harmfulness and recover refusals at any point in generation. Across diverse open-source model families (Llama, Gemma, Mistral, Qwen, DeepSeek, and gpt-oss), ADA achieves robust safety performance without requiring any changes to the base model's parameters. It secures a near-100% refusal rate against challenging adversarial prefill attacks ranging from dozens to thousands of tokens. Furthermore, ADA reduces the average success rate of prominent adversarial prompt attacks (such as GCG, AutoDAN, PAIR, and TAP) to below 3%. This is all accomplished while preserving utility on benign tasks with minimal over-refusal. ADA maintains this resilience even after the base model undergoes subsequent instruction tuning (benign or adversarial).
Community
In this paper, we propose a new alignment technique, Any-Depth Alignment (ADA), which unlocks a model’s innate safety alignment to maintain consistent alignment at any depth of generation. Our study demonstrates:
(1) New Alignment Failure with Deep Prefills. We introduce deep prefill attacks to examine whether models possess a generalizable notion of harmfulness that extends beyond a fixed context depth. Existing alignment strategies fail this test—both in recent open-source OpenAI oss models and in strongly deep-aligned systems like Claude Sonnet 4, where refusal rates collapse under slightly deeper prefills.
(2) "Rethinking" Generation (ADA(RK)). Re-injecting Safety Tokens mid-stream triggers a robust rethinking behavior that restores refusals. This generative defense is training-free and performs on par with, and often better than, deep alignment and self-reflection baselines.
(3) Unlocking Deeper Innate Alignment (ADA(LP)). We trace the rethinking phenomenon to the Safety Tokens whose hidden states are highly separable for harmful content. By leveraging this, ADA(LP) is: (a) Effective, achieving near-100% refusal against deep prefills and reducing adversarial success from over 50% to under 3%; (b) Precise, with minimal over-refusal on benign tasks; and (c) Robust, maintaining performance even when the base model is fine-tuned.
(4) A General Phenomenon Across Diverse LLMs. The unlocking effect is ubiquitous: Safety Tokens related to the assistant header consistently expose a strong, linearly separable harmfulness signal across model families (Llama, Qwen, Mistral, Gemma, DeepSeek variants, gpt-oss), parameter scales, and core designs (dense, Mixture-of-Experts, and reasoning-centric).
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MetaDefense: Defending Finetuning-based Jailbreak Attack Before and During Generation (2025)
- A2D: Any-Order, Any-Step Safety Alignment for Diffusion Language Models (2025)
- Beyond Surface Alignment: Rebuilding LLMs Safety Mechanism via Probabilistically Ablating Refusal Direction (2025)
- Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection (2025)
- Turning the Spell Around: Lightweight Alignment Amplification via Rank-One Safety Injection (2025)
- Embedding Poisoning: Bypassing Safety Alignment via Embedding Semantic Shift (2025)
- PromptSleuth: Detecting Prompt Injection via Semantic Intent Invariance (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper