# openPangu-Ultra-MoE-718B-V1.1-Int8 English | [中文](README.md) ## 1. Introduction The openPangu-Ultra-MoE-718B-V1.1 is a large-scale mixture-of-experts language model trained on Ascend NPU, with a total parameter count of 718B and 39B activated parameters per token. It is equipped with the capability to switch between fast and slow thinking. Compared to the earlier [[openPangu-Ultra-MoE-718B-V1.0](https://ai.gitcode.com/ascend-tribe/openpangu-ultra-moe-718b-model/blob/main/README_EN.md)] version, V1.1 improves greatly on better agentic tool use ability and lower hallucination rate, and it also strengths on other general and reasoning task. **openPangu-Ultra-MoE-718B-V1.1-Int8 is a quantized version of [[openPangu-Ultra-MoE-718B-V1.1](https://ai.gitcode.com/ascend-tribe/openPangu-Ultra-MoE-718B-V1.1/blob/main/README_EN.md)] using dynamic per-token quantization, which reduces GPU memory usage by approximately half, increases throughput by 20%, and achieves an accuracy loss of less than 1%.** ## 2. Model Architecture The architecture of the openPangu-Ultra-MoE-718B-V1.1-Int8 adopts the mainstream Multi-head Latent Attention (MLA), Multi-Token Prediction (MTP), high MoE sparsity, and features several different designs: - Depth-Scaled Sandwich-Norm and TinyInit: These techniques adjust the layer normalization structure and parameter initialization for improved training stability. - EP-Group load balancing loss: This technique optimizes the load balancing loss, achieving better expert specialization. ## 3. Inference Usage The method for inferring openPangu-Ultra-MoE-718B-V1.1-Int8 using Omni-Infer is referenced as follows: [[Omni-Infer User Guide](doc/omniinfer_for_openPangu-Ultra-MoE-718B-V1.1-Int8_EN.md)]. ## 4. Function Call Usage Examples So far the open-sourced Omni-Infer Engine supports Function Call,the support on vllm_ascend engine will be released soon. ``` import requests,json # Define tools in a list of Json format, it also support MCP protocol tools = [ { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather information of the specified city, including temperature, humidity, wind speed and other data.", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "City name, for example: 'Beijing', 'Shenzhen'. Supports Chinese or Pinyin input." }, "date": { "type": "string", "description": "Query date in the format of YYYY-MM-DD (compliant with ISO 8601). For example: '2023-10-01'." } }, "required": ["location", "date"], "additionalProperties": False } } } ] messages = [ {"role": "system", "content": "You are the Pangu model developed by Huawei. \nIt is October 13, 2025"}, # customize system prompt,set it to empty when not needed. {"role": "user", "content": "How is the whether in ShenZhen the day after tomorrow?"} ] headers = {'Content-Type': 'application/json'} api_url = "xxxxxxxx" payload = { "model": "pangu_ultra_moe", "messages": messages, "tools": tools, "chat_template_kwargs":{ "think": False, # control the switch between fast and slow thinking. set False to be fast-thinking,and its True by default (slow-thiking) "mcp_prompt": True # control whether to used the default system prompt for tool usage, and it's set to True by default. } } api_response = requests.post(api_url, headers=headers, json=payload) # dealing with returned messages choice = api_response.json()["choices"][0] reasoning_response = choice['message']['reasoning_content'] response = choice['message']['content'] tool_calls = choice['message']['tool_calls'] ``` **chat_template_kwargs parameters for switching fast and slow thinking and tool usage.** - think: whether to turn on slow-thinking, and it's set to True by default; - mcp_prompt: whether to use default system prompt for Function Call mode, it's set to True by default. If it is True and a list of tools are passed, we will append the customized system prompt with the defaults system prompt for Function Call mode. ## 5. Model License Unless otherwise noted, the openPangu-Ultra-MoE-718B-V1.1-Int8 model is licensed under the terms and conditions of OPENPANGU MODEL LICENSE AGREEMENT VERSION 1.0, which is intended to be used permissively and enable the further development of artificial intelligence technologies. Please refer to the [LICENSE](LICENSE) file located in the root directory of the model repository for details. ## 6. Disclaimer Due to the technical limitations inherent in the technology on which the openPangu-Ultra-MoE-718B-V1.1-Int8 (“Model”) relies and the fact that the artificial intelligence generated content is automatically produced by Model, Huawei cannot make any guarantees regarding the following matters: - The output of this Model is automatically generated via AI algorithms, it does not rule out the possibility that some of the information may be flawed, unreasonable, or cause discomfort, and the generated content does not represent Huawei's attitude or standpoint; - There is no guarantee that this Model is 100% accurate, reliable, functional, timely, secure and safety, error-free, uninterrupted, continuously stable, or free of any faults; - The output of this Model does not constitute any advices or decisions for you, and it does not guarantee the authenticity, completeness, accuracy, timeliness, legality, functionality, or practicality of the generated content. The generated content cannot replace professionals in medical, legal, and other fields in answering your questions. The generated content is for your reference only and does not represent any attitude, standpoint, or position of Huawei. You need to make independent judgments based on your actual situation, and Huawei does not assume any responsibilities. ## 7. Contact If you have any question, please raise an issue or contact us at [openPangu@huawei.com](url).