arxiv:2510.18081

Any-Depth Alignment: Unlocking Innate Safety Alignment of LLMs to Any-Depth

Published on Oct 20

· Submitted by

Jiawei Zhang on Oct 22

ByteDance Seed

Upvote

Authors:

Abstract

Any-Depth Alignment (ADA) is an inference-time defense that enhances the safety of Large Language Models (LLMs) by reintroducing alignment tokens mid-stream, ensuring robust protection against adversarial attacks without altering the model's parameters.

AI-generated summary

Large Language Models (LLMs) exhibit strong but shallow alignment: they directly refuse harmful queries when a refusal is expected at the very start of an assistant turn, yet this protection collapses once a harmful continuation is underway (either through the adversarial attacks or via harmful assistant-prefill attacks). This raises a fundamental question: Can the innate shallow alignment in LLMs be unlocked to ensure safety at arbitrary generation depths? To achieve this goal, we propose Any-Depth Alignment (ADA), an effective inference-time defense with negligible overhead. ADA is built based on our observation that alignment is concentrated in the assistant header tokens through repeated use in shallow-refusal training, and these tokens possess the model's strong alignment priors. By reintroducing these tokens mid-stream, ADA induces the model to reassess harmfulness and recover refusals at any point in generation. Across diverse open-source model families (Llama, Gemma, Mistral, Qwen, DeepSeek, and gpt-oss), ADA achieves robust safety performance without requiring any changes to the base model's parameters. It secures a near-100% refusal rate against challenging adversarial prefill attacks ranging from dozens to thousands of tokens. Furthermore, ADA reduces the average success rate of prominent adversarial prompt attacks (such as GCG, AutoDAN, PAIR, and TAP) to below 3%. This is all accomplished while preserving utility on benign tasks with minimal over-refusal. ADA maintains this resilience even after the base model undergoes subsequent instruction tuning (benign or adversarial).

View arXiv page View PDF Add to collection

Community

javyduck

Paper submitter 9 days ago

In this paper, we propose a new alignment technique, Any-Depth Alignment (ADA), which unlocks a model’s innate safety alignment to maintain consistent alignment at any depth of generation. Our study demonstrates:

(1) New Alignment Failure with Deep Prefills. We introduce deep prefill attacks to examine whether models possess a generalizable notion of harmfulness that extends beyond a fixed context depth. Existing alignment strategies fail this test—both in recent open-source OpenAI oss models and in strongly deep-aligned systems like Claude Sonnet 4, where refusal rates collapse under slightly deeper prefills.

(2) "Rethinking" Generation (ADA(RK)). Re-injecting Safety Tokens mid-stream triggers a robust rethinking behavior that restores refusals. This generative defense is training-free and performs on par with, and often better than, deep alignment and self-reflection baselines.

(3) Unlocking Deeper Innate Alignment (ADA(LP)). We trace the rethinking phenomenon to the Safety Tokens whose hidden states are highly separable for harmful content. By leveraging this, ADA(LP) is: (a) Effective, achieving near-100% refusal against deep prefills and reducing adversarial success from over 50% to under 3%; (b) Precise, with minimal over-refusal on benign tasks; and (c) Robust, maintaining performance even when the base model is fine-tuned.

(4) A General Phenomenon Across Diverse LLMs. The unlocking effect is ubiquitous: Safety Tokens related to the assistant header consistently expose a strong, linearly separable harmfulness signal across model families (Llama, Qwen, Mistral, Gemma, DeepSeek variants, gpt-oss), parameter scales, and core designs (dense, Mixture-of-Experts, and reasoning-centric).

librarian-bot

9 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.18081 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.18081 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.18081 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.