zhangchenxu commited on
Commit
50ac451
·
verified ·
1 Parent(s): 13e3e80

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +51 -0
README.md ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Agent-Ark/Toucan-1.5M
5
+ language:
6
+ - en
7
+ base_model:
8
+ - Qwen/Qwen2.5-7B-Instruct
9
+ tags:
10
+ - agent
11
+ ---
12
+
13
+ # 🦤 Toucan-1.5M:
14
+
15
+ Toucan-1.5M is the largest fully synthetic tool-agent dataset to date, designed to advance tool use in agentic LLMs. It comprises over 1.5 million trajectories synthesized from 495 real-world Model Context Protocols (MCPs) spanning 2,000+ tools. By leveraging authentic MCP environments, Toucan-1.5M generates diverse, realistic, and challenging tasks requires using multiple tools, with trajectories involving real tool executions across multi-round, multi-turn, sequential, and parallel tool calls. Models fine-tuned on Toucan-1.5M outperform much larger closed-source counterparts on the BFCL V3 benchmark and extend the Pareto frontier on the MCP-Universe benchmark.
16
+
17
+ - 📄 [Technical Report](https://arxiv.org/abs/2510.01179) - Discover the methodology and technical details behind Toucan-1.5M
18
+ - 💾 [Github Repo](https://github.com/TheAgentArk/Toucan) - Access the complete pipeline used to produce Toucan-1.5M
19
+ - 🤗 [HF Dataset](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) - Full dataset (You are here!)
20
+ - 🤖 Model Checkpoints - [Qwen2.5-7B](https://huggingface.co/Agent-Ark/Toucan-Qwen2.5-7B-Instruct-v0.1) | [Qwen2.5-14B](https://huggingface.co/Agent-Ark/Toucan-Qwen2.5-7B-Instruct-v0.1) | [Qwen2.5-32B](https://huggingface.co/Agent-Ark/Toucan-Qwen2.5-32B-Instruct-v0.1)
21
+
22
+ ![Toucan-Pipeline](https://cdn-uploads.huggingface.co/production/uploads/653df1323479e9ebbe3eb6cc/Dcz-NP1tfcJriku8FP2OT.jpeg)
23
+
24
+ ## About This Model
25
+
26
+ This model is a fine-tuned variant of **Qwen2.5-14B-Instruct**, trained on a curated subset of the [Toucan-1.5M](https://huggingface.co/datasets/Agent-Ark/Toucan-1.5M) dataset. The supervised fine-tuning (SFT) subset consists of **119.3K instances** in total, including:
27
+
28
+ - **28.3K** from the original pipeline
29
+ - **40K** from Extension 1 (*Irrelevance*)
30
+ - **15.8K** from Extension 2 (*Diversify*)
31
+ - **35.2K** from Extension 3 (*Multi-Turn*)
32
+
33
+ We adopt the `Hermes` prompt template for fine-tuning. For a detailed description of the training setup and hyperparameters, please refer to our [technical report](https://arxiv.org/abs/2510.01179).
34
+
35
+
36
+ ## 📚 Citation
37
+
38
+ If you find the data or code useful, please cite:
39
+ ```
40
+ @misc{xu2025toucan,
41
+ title={TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments},
42
+ author={Zhangchen Xu and Adriana Meza Soria and Shawn Tan and Anurag Roy and Ashish Sunil Agrawal and Radha Poovendran and Rameswar Panda},
43
+ year={2025},
44
+ eprint={2510.01179},
45
+ archivePrefix={arXiv},
46
+ primaryClass={cs.LG},
47
+ url={https://arxiv.org/abs/2510.01179},
48
+ }
49
+ ```
50
+
51
+ **Contact**: For questions, please contact [Zhangchen](mailto:[email protected]) by email.