moonshotai
/

Kimi-Linear-48B-A3B-Base

@@ -1,17 +1,26 @@
 ---
 license: mit
 ---
 <div align="center">
-  <a href="https://github.com/MoonshotAI/Kimi-Linear/blob/master/tech_report.pdf"><img width="80%" src="figures/banner.png"></a>
 </div>
 <div align="center">
-  <a href="https://github.com/MoonshotAI/Kimi-Linear/blob/master/tech_report.pdf" ><img src="figures/logo.png" height="16" width="16" style="display: inline-block; vertical-align: middle; margin: 2px;"><b style="display: inline-block;"> Tech Report</b></a>  |
   <a href="https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Base"><img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" height="16" width="16" style="display: inline-block; vertical-align: middle; margin: 2px;"><b style="display: inline-block;"> HuggingFace</b></a>
 </div>
 <div align="center">
-  <img width="90%" src="figures/perf_speed.png">
   <p><em><b>(a)</b> On MMLU-Pro (4k context length), Kimi Linear achieves 51.0 performance with similar speed as full attention. On RULER (128k context length), it shows Pareto-optimal performance (84.3) and 3.98x speedup. <b>(b)</b> Kimi Linear achieves 6.3x faster TPOT compared to MLA, offering significant speedups at long sequence lengths (1M tokens).</em></p>
 </div>
@@ -38,7 +47,7 @@ We open-source the KDA kernel in [FLA](https://github.com/fla-org/flash-linear-a
 - **High Throughput:** Achieves up to $6\times$ faster decoding and significantly reduces time per output token (TPOT).
 <div align="center">
-  <img width="60%" src="figures/arch.png">
 </div>
 ## Usage
@@ -94,14 +103,16 @@ vllm serve moonshotai/Kimi-Linear-48B-A3B-Instruct \
   --trust-remote-code
 ```
-### Citation
-If you found our work useful, please cite
 ```bibtex
-@article{kimi2025kda,
-  title  = {Kimi Linear: An Expressive, Efficient Attention Architecture},
-  author = {kimi Team},
-  year   = {2025},
-  url    = {https://github.com/MoonshotAI/Kimi-Linear/blob/master/tech_report.pdf}
 }
 ```

 ---
 license: mit
+pipeline_tag: text-generation
+library_name: transformers
 ---
+# Kimi Linear: An Expressive, Efficient Attention Architecture
+This model is presented in the paper [Kimi Linear: An Expressive, Efficient Attention Architecture](https://huggingface.co/papers/2510.26692).
+The official code can be found at: [https://github.com/MoonshotAI/Kimi-Linear](https://github.com/MoonshotAI/Kimi-Linear)
 <div align="center">
+  <a href="https://huggingface.co/papers/2510.26692"><img width="80%" src="https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct/resolve/main/figures/banner.png"></a>
 </div>
 <div align="center">
+  <a href="https://huggingface.co/papers/2510.26692" ><img src="https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct/resolve/main/figures/logo.png" height="16" width="16" style="display: inline-block; vertical-align: middle; margin: 2px;"><b style="display: inline-block;"> Paper</b></a>  |
+  <a href="https://github.com/MoonshotAI/Kimi-Linear"><img src="https://img.shields.io/badge/Github-Code-blue.svg?logo=github&style=flat-square" height="16" style="display: inline-block; vertical-align: middle; margin: 2px;"><b style="display: inline-block;"> Code</b></a> |
   <a href="https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Base"><img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg" height="16" width="16" style="display: inline-block; vertical-align: middle; margin: 2px;"><b style="display: inline-block;"> HuggingFace</b></a>
 </div>
 <div align="center">
+  <img width="90%" src="https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct/resolve/main/figures/perf_speed.png">
   <p><em><b>(a)</b> On MMLU-Pro (4k context length), Kimi Linear achieves 51.0 performance with similar speed as full attention. On RULER (128k context length), it shows Pareto-optimal performance (84.3) and 3.98x speedup. <b>(b)</b> Kimi Linear achieves 6.3x faster TPOT compared to MLA, offering significant speedups at long sequence lengths (1M tokens).</em></p>
 </div>
 - **High Throughput:** Achieves up to $6\times$ faster decoding and significantly reduces time per output token (TPOT).
 <div align="center">
+  <img width="60%" src="https://huggingface.co/moonshotai/Kimi-Linear-48B-A3B-Instruct/resolve/main/figures/arch.png">
 </div>
 ## Usage
   --trust-remote-code
 ```
+## Citation
+If you found our work useful, please cite:
 ```bibtex
+@misc{team2025kimi,
+    title         = {Kimi Linear: An Expressive, Efficient Attention Architecture},
+    author        = {Zhang, Yu  and Lin, Zongyu  and Yao, Xingcheng  and Hu, Jiaxi  and Meng, Fanqing  and Liu, Chengyin  and Men, Xin  and Yang, Songlin  and Li, Zhiyuan  and Li, Wentao  and Lu, Enzhe  and Liu, Weizhou  and Chen, Yanru  and Xu, Weixin  and Yu, Longhui  and Wang, Yejie  and Fan, Yu  and Zhong, Longguang  and Yuan, Enming  and Zhang, Dehao  and Zhang, Yizhi  and T. Liu, Y.  and Wang, Haiming  and Fang, Shengjun  and He, Weiran  and Liu, Shaowei  and Li, Yiwei  and Su, Jianlin  and Qiu, Jiezhong  and Pang, Bo  and Yan, Junjie  and Jiang, Zhejun  and Huang, Weixiao  and Yin, Bohong  and You, Jiacheng  and Wei, Chu  and Wang, Zhengtao  and Hong, Chao  and Chen, Yutian  and Chen, Guanduo  and Wang, Yucheng  and Zheng, Huabin  and Wang, Feng  and Liu, Yibo  and Dong, Mengnan  and Zhang, Zheng  and Pan, Siyuan  and Wu, Wenhao  and Wu, Yuhao  and Guan, Longyu  and Tao, Jiawen  and Fu, Guohong  and Xu, Xinran  and Wang, Yuzhi  and Lai, Guokun  and Wu, Yuxin  and Zhou, Xinyu  and Yang, Zhilin  and Du, Yulun},
+    year          = {2025},
+    eprint        = {2510.26692},
+    archivePrefix = {arXiv},
+    primaryClass  = {cs.CL}
 }
 ```