Update README.md
Browse files
README.md
CHANGED
|
@@ -21,12 +21,25 @@ pipeline_tag: text-classification
|
|
| 21 |
|
| 22 |
## π₯ Highlights
|
| 23 |
|
| 24 |
-
**Skywork-Reward-V2** is a series of reward models designed for versatility across a wide range of tasks, trained on a mixture of 26 million carefully curated preference pairs. While the Skywork-Reward-V2 series remains based on the Bradley-Terry model, we push the boundaries of training data scale and quality to achieve superior performance. Compared to the first generation of Skywork-Reward, the Skywork-Reward-V2 series offers the following major improvements:
|
| 25 |
|
| 26 |
- **Trained on a significantly larger and higher-quality preference data mixture**, consisting of **26 million preference pairs** curated via a large-scale human-LLM synergistic pipeline.
|
| 27 |
- **State-of-the-art performance on seven major reward model benchmarks**, including RewardBench v1, RewardBench v2, PPE Preference, PPE Correctness, RMB, RM-Bench, and JudgeBench.
|
| 28 |
- **Available in eight models across multiple sizes**, with the smallest 0.6B variant, *Skywork-Reward-V2-Qwen3-0.6B*, nearly matching the average performance of our previous best model, Skywork-Reward-Gemma-2-27B-v0.2. The largest 8B version, *Skywork-Reward-V2-Llama-3.1-8B*, surpasses all existing reward models across all benchmarks on average. Our top experimental model, *Skywork-Reward-V2-Llama-3.1-8B-40M*, **outperforms all existing reward models on every benchmark**.
|
| 29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
## π Evaluation
|
| 31 |
|
| 32 |
In the following table, we categorize the models into two types: Bradley-Terry (BT) reward models and Generative reward models. The Skywork-Reward-V2 series outperforms models in both categories with much smaller model sizes.
|
|
|
|
| 21 |
|
| 22 |
## π₯ Highlights
|
| 23 |
|
| 24 |
+
**Skywork-Reward-V2** is a series of eight reward models designed for versatility across a wide range of tasks, trained on a mixture of 26 million carefully curated preference pairs. While the Skywork-Reward-V2 series remains based on the Bradley-Terry model, we push the boundaries of training data scale and quality to achieve superior performance. Compared to the first generation of Skywork-Reward, the Skywork-Reward-V2 series offers the following major improvements:
|
| 25 |
|
| 26 |
- **Trained on a significantly larger and higher-quality preference data mixture**, consisting of **26 million preference pairs** curated via a large-scale human-LLM synergistic pipeline.
|
| 27 |
- **State-of-the-art performance on seven major reward model benchmarks**, including RewardBench v1, RewardBench v2, PPE Preference, PPE Correctness, RMB, RM-Bench, and JudgeBench.
|
| 28 |
- **Available in eight models across multiple sizes**, with the smallest 0.6B variant, *Skywork-Reward-V2-Qwen3-0.6B*, nearly matching the average performance of our previous best model, Skywork-Reward-Gemma-2-27B-v0.2. The largest 8B version, *Skywork-Reward-V2-Llama-3.1-8B*, surpasses all existing reward models across all benchmarks on average. Our top experimental model, *Skywork-Reward-V2-Llama-3.1-8B-40M*, **outperforms all existing reward models on every benchmark**.
|
| 29 |
|
| 30 |
+
| Model | Basa Model | Link |
|
| 31 |
+
|:-----------------------------------|:--------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------:|
|
| 32 |
+
| Skywork-Reward-V2-Llama-3.1-8B | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | [π€ Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Llama-3.1-8B) |
|
| 33 |
+
| Skywork-Reward-V2-Llama-3.1-8B-40M | [meta-llama/Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | [π€ Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Llama-3.1-8B-40M) |
|
| 34 |
+
| Skywork-Reward-V2-Llama-3.2-1B | [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct) | [π€ Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Llama-3.2-1B) |
|
| 35 |
+
| Skywork-Reward-V2-Llama-3.2-3B | [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | [π€ Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Llama-3.2-3B) |
|
| 36 |
+
| Skywork-Reward-V2-Qwen3-0.6B | [Qwen/Qwen3-0.6B-Instruct](https://huggingface.co/Qwen/Qwen3-0.6B-Instruct) | [π€ Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-0.6B) |
|
| 37 |
+
| Skywork-Reward-V2-Qwen3-1.7B | [Qwen/Qwen3-1.7B-Instruct](https://huggingface.co/Qwen/Qwen3-1.7B-Instruct) | [π€ Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-1.7B) |
|
| 38 |
+
| Skywork-Reward-V2-Qwen3-4B | [Qwen/Qwen3-4B-Instruct](https://huggingface.co/Qwen/Qwen3-4B-Instruct) | [π€ Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-4B) |
|
| 39 |
+
| Skywork-Reward-V2-Qwen3-8B | [Qwen/Qwen3-8B-Instruct](https://huggingface.co/Qwen/Qwen3-8B-Instruct) | [π€ Hugging Face](https://huggingface.co/Skywork/Skywork-Reward-V2-Qwen3-8B) |
|
| 40 |
+
|
| 41 |
+
For the complete collection of models, please refer to the [Skywork-Reward-V2](https://huggingface.co/collections/Skywork/skywork-reward-v2-685cc86ce5d9c9e4be500c84) collection.
|
| 42 |
+
|
| 43 |
## π Evaluation
|
| 44 |
|
| 45 |
In the following table, we categorize the models into two types: Bradley-Terry (BT) reward models and Generative reward models. The Skywork-Reward-V2 series outperforms models in both categories with much smaller model sizes.
|