RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference
Abstract
Reinforcement learning-based mixed precision quantization method achieves superior compression efficiency and model performance for large language models through adaptive bit width assignment and novel scale folding techniques.
Post training quantization is essential for deploying large language models (LLMs) on resource constrained hardware, yet state of the art methods enforce uniform bit widths across layers, yielding suboptimal accuracy efficiency trade offs. We present RAMP (Reinforcement Adaptive Mixed Precision), an off policy Soft Actor Critic framework that learns per layer bit width assignments to minimize perplexity under a global bit budget. The policy conditions on an 11 dimensional embedding of activation statistics, weight properties, and structural descriptors, enabling zero shot transfer across model families and scales. To enable stable sub 4 bit quantization, we introduce Scale Folding, a preconditioning technique that migrates activation outliers into weights via per channel scaling and normalization layer compensation. A quality prioritized reward with asymmetric penalties and budget cliffs drives rapid convergence. On Llama 2 7B, RAMP achieves 5.54 perplexity at 3.68GB (3.65 effective bits), outperforming uniform 4 bit AWQ (5.60 at 3.90 GB) and GPTQ by 6% in size and 1% to3% in quality. Critically, a policy trained only on Llama 2 7B generalizes zero shot to Llama 2 13B and Mistral 7B, often surpassing target specific training, supporting the hypothesis that quantization sensitivity is primarily architectural. The HALO pipeline exports allocations to GGUF format for kernel free inference on CPUs, GPUs, and edge devices, retaining 99.5% of FP16 commonsense reasoning performance.
Community
We introduce RAMP (Reinforcement Adaptive Mixed Precision), an off-policy Soft Actor-Critic framework for learning per-layer bit-width allocations in LLM quantization.
RAMP treats quantization as a sequential decision problem: a policy assigns bit-widths under a global memory budget using an 11-dimensional state capturing activation statistics, weight properties, and structural features. This enables a single learned policy to generalize across models.
Key results:
• Llama-2-7B: 5.54 perplexity at 3.68 GB (3.65 effective bits)
• Outperforms uniform 4-bit AWQ (5.60 at 3.90 GB) and GPTQ in both size (6%) and quality (1–3%)
• Zero-shot transfer from Llama-2-7B to Llama-2-13B and Mistral-7B
We also introduce Scale Folding, a preconditioning method that improves stability in sub-4-bit regimes by redistributing activation outliers into weights.
Finally, the HALO pipeline exports learned allocations directly to GGUF, enabling kernel-free deployment across CPUs, GPUs, and edge devices while retaining ~99.5% of FP16 commonsense performance.
Overall, results suggest that quantization sensitivity can be learned and transferred, rather than tuned per model.
Would be interested in feedback, especially on RL-based approaches to model compression and cross-model generalization.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- D2Quant: Accurate Low-bit Post-Training Weight Quantization for LLMs (2026)
- TernaryLM: Memory-Efficient Language Modeling via Native 1-Bit Quantization with Adaptive Layer-wise Scaling (2026)
- NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models (2026)
- ScaleBITS: Scalable Bitwidth Search for Hardware-Aligned Mixed-Precision LLMs (2026)
- SERQ: Saliency-Aware Low-Rank Error Reconstruction for LLM Quantization (2026)
- 1-Bit Wonder: Improving QAT Performance in the Low-Bit Regime through K-Means Quantization (2026)
- BPDQ: Bit-Plane Decomposition Quantization on a Variable Grid for Large Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper