arxiv:2509.02753

LExI: Layer-Adaptive Active Experts for Efficient MoE Model Inference

Published on Sep 2

Authors:

Abstract

LExI, a data-free optimization technique, improves inference efficiency in MoE models by adaptively assigning the number of active experts per layer, outperforming traditional pruning methods with minimal accuracy loss.

AI-generated summary

Mixture-of-Experts (MoE) models scale efficiently by activating only a subset of experts per token, offering a computationally sparse alternative to dense architectures. While prior post-training optimizations, such as inter- and intra-expert pruning, reduce memory usage they provide limited gains in inference-time compute efficiency. Moreover, existing MoE architectures typically activate a fixed number of experts uniformly across all layers, resulting in redundant computation and suboptimal performance. In this work, we first demonstrate that MoE pruning strategies improve only the memory footprint but do not significantly improve inference performance on GPU using optimized frameworks such as vLLM. To address this, we introduce LExI, a data-free optimization technique that determines the optimal number of active experts per layer in a pretrained MoE model. LExI leverages only the model weights to estimate the relative importance of each layer and adaptively assigns the number of active experts accordingly per layer. Experiments on state-of-the-art language and vision MoE benchmarks demonstrate that LExI significantly outperforms traditional MoE pruning approaches in terms of inference efficiency with negligible accuracy loss. For example, using LExI, Qwen1.5-MoE achieves the same throughput on Nvidia H100 GPU with 10% better accuracy than traditional expert pruning.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.02753 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.02753 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.02753 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.