dongguanting nielsr HF Staff commited on
Commit
ea3d670
·
verified ·
1 Parent(s): 270178b

Enhance model card with metadata and details (#1)

Browse files

- Enhance model card with metadata and details (262cf614c65f25bac2e99a1e8267ff8a82de8f3c)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +102 -4
README.md CHANGED
@@ -1,8 +1,106 @@
 
 
 
 
 
1
 
2
- The model checkpoint of ARPO:
 
 
3
 
4
- Arxiv: https://arxiv.org/abs/2507.19849
5
 
6
- HF paper: https://huggingface.co/papers/2507.19849
7
 
8
- Github: https://github.com/dongguanting/ARPO
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: text-generation
4
+ library_name: transformers
5
+ ---
6
 
7
+ <div align="center">
8
+ <img src="https://github.com/dongguanting/ARPO/blob/main/logo1.png" width="150px">
9
+ </div>
10
 
11
+ <h1 align="center" style="margin-top: -50px;">✨ Agentic Reinforced Policy Optimization (ARPO)</h1>
12
 
13
+ <div align="center">
14
 
15
+ [![Paper](https://img.shields.io/badge/Paper-arXiv-b5212f.svg?logo=arxiv)](https://arxiv.org/abs/2507.19849)
16
+ [![Paper](https://img.shields.io/badge/Paper-Hugging%20Face-yellow?logo=huggingface)](https://huggingface.co/papers/2507.19849)
17
+ [![Model](https://img.shields.io/badge/Model-Hugging%20Face-blue?logo=huggingface)](https://huggingface.co/collections/dongguanting/arpo-688229ff8a6143fe5b4ad8ae)
18
+ [![Dataset](https://img.shields.io/badge/Dataset-Hugging%20Face-blue?logo=huggingface)](https://huggingface.co/collections/dongguanting/arpo-688229ff8a6143fe5b4ad8ae)
19
+ [![License](https://img.shields.io/badge/LICENSE-MIT-green.svg)](https://opensource.org/licenses/MIT)
20
+ [![Python 3.10+](https://img.shields.io/badge/Python-3.10+-blue.svg)](https://www.python.org/downloads/release/python-390/)
21
+ [![X (formerly Twitter) URL](https://img.shields.io/twitter/url?url=https%3A%2F%2Fx.com%2FKevin_GuoweiXu%2Fstatus%2F1858338565463421244)]()
22
+ </div>
23
+
24
+ **Agentic Reinforced Policy Optimization (ARPO)** is a novel agentic RL algorithm tailored for training multi-turn Large Language Model (LLM)-based agents. It addresses the challenges of balancing LLMs' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.
25
+
26
+ ## 💡 Overview
27
+
28
+ We propose **Agentic Reinforced Policy Optimization (ARPO)**, **an agentic RL algorithm tailored for training multi-turn LLM-based agent**. The core principle of ARPO is to encourage the policy model to adaptively branch sampling during high-entropy tool-call rounds, thereby efficiently aligning step-level tool-use behaviors.
29
+
30
+ <img width="1686" height="866" alt="intro" src="https://github.com/user-attachments/assets/8b9daf54-c4ba-4e79-bf79-f98b5a893edd" />
31
+
32
+ - In figure (left), The initial tokens generated by the LLM after receiving **each round of tool-call feedback consistently exhibit a high entropy**. This indicates that external tool-call significantly **introduces uncertainty into the LLM’s reasoning process**.
33
+
34
+ - In the figure (right), we validate ARPO's performance **across 13 datasets**. Notably, Qwen3-14B with ARPO excelled in Pass@5, **achieving 61.2% on GAIA and 24.0% on HLE**, while requiring only about **half the tool calls** compared to GRPO during training.
35
+
36
+ ## 🏃 Quick Start
37
+
38
+ This section provides a basic example of how to perform inference with an ARPO-trained model using the `transformers` library. For more detailed instructions on training and evaluation, please refer to the [official GitHub repository](https://github.com/dongguanting/ARPO).
39
+
40
+ ```python
41
+ import torch
42
+ from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
43
+
44
+ # Load the model and tokenizer
45
+ # Replace "dongguanting/Llama3.1-8B-ARPO" with the specific ARPO checkpoint you want to use.
46
+ model_name = "dongguanting/Llama3.1-8B-ARPO" # Example ARPO model
47
+ model = AutoModelForCausalLM.from_pretrained(
48
+ model_name,
49
+ torch_dtype=torch.bfloat16, # Use bfloat16 for better performance on compatible hardware
50
+ device_map="auto",
51
+ trust_remote_code=True # Required for custom modeling if applicable
52
+ ).eval()
53
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
54
+
55
+ # Set generation configuration based on model's generation_config.json
56
+ model.generation_config = GenerationConfig.from_pretrained(
57
+ model_name,
58
+ temperature=0.6,
59
+ top_p=0.9,
60
+ do_sample=True,
61
+ eos_token_id=[128001, 128008, 128009], # From special_tokens_map.json and generation_config.json
62
+ pad_token_id=tokenizer.eos_token_id, # Common practice for LLMs
63
+ )
64
+
65
+ # Prepare messages using the chat template (e.g., Llama 3.1 or similar)
66
+ messages = [
67
+ {"role": "system", "content": "You are a helpful AI assistant."},
68
+ {"role": "user", "content": "What is the capital of France?"}
69
+ ]
70
+
71
+ # Apply chat template and tokenize input
72
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
73
+ input_ids = tokenizer(text, return_tensors="pt").input_ids.to(model.device)
74
+
75
+ # Generate response
76
+ with torch.no_grad():
77
+ output_ids = model.generate(input_ids, max_new_tokens=256)
78
+
79
+ # Decode and print the generated text, excluding the input prompt
80
+ response = tokenizer.decode(output_ids[0][input_ids.shape[1]:], skip_special_tokens=True).strip()
81
+
82
+ print(f"Assistant: {response}")
83
+ ```
84
+
85
+ ## 📄 Citation
86
+
87
+ If you find this work helpful, please cite our paper:
88
+ ```bibtex
89
+ @misc{dong2025arpo,
90
+ title={Agentic Reinforced Policy Optimization},
91
+ author={Guanting Dong and Hangyu Mao and Kai Ma and Licheng Bao and Yifei Chen and Zhongyuan Wang and Zhongxia Chen and Jiazhen Du and Huiyang Wang and Fuzheng Zhang and Guorui Zhou and Yutao Zhu and Ji-Rong Wen and Zhicheng Dou},
92
+ year={2025},
93
+ eprint={2507.19849},
94
+ archivePrefix={arXiv},
95
+ primaryClass={cs.LG},
96
+ url={https://arxiv.org/abs/2507.19849},
97
+ }
98
+ ```
99
+
100
+ ## 🤝 Acknowledge
101
+
102
+ This training implementation builds upon [Tool-Star](https://github.com/dongguanting/Tool-Star), [Llama Factory](https://github.com/hiyouga/LLaMA-Factory), [verl](https://github.com/volcengine/verl) and [ReCall](https://github.com/Agent-RL/ReCall). For evaluation, we rely on [WebThinker](https://github.com/RUC-NLPIR/WebThinker), [HIRA](https://github.com/RUC-NLPIR/HiRA), [WebSailor](https://github.com/Alibaba-NLP/WebAgent), [Search-o1](https://github.com/sunnynexus/Search-o1), and [FlashRAG](https://github.com/RUC-NLPIR/FlashRAG). The Python interpreter design references [ToRA](https://github.com/microsoft/ToRA) and [ToRL](https://github.com/GAIR-NLP/ToRL), while our models are trained using [Qwen2.5](https://qwenlm.github.io/blog/qwen2.5/). We express our sincere gratitude to these projects for their invaluable contributions to the open-source community.
103
+
104
+ ## 📞 Contact
105
+
106
+ For any questions or feedback, please reach out to us at [[email protected]]([email protected]).