Add descriptive tags to model card
Browse filesThis PR improves the model card by adding more descriptive tags to the metadata, such as `audio-question-answering`, `reinforcement-learning`, and `multimodal-llm`. These tags will help users discover the model more easily when searching on the Hub.
All existing content and links (to GitHub and the technical report on arXiv) remain unchanged as they are already accurate and complete for this model.
README.md
CHANGED
|
@@ -1,8 +1,11 @@
|
|
| 1 |
---
|
| 2 |
library_name: transformers
|
| 3 |
license: apache-2.0
|
| 4 |
-
tags: []
|
| 5 |
pipeline_tag: audio-text-to-text
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
---
|
| 7 |
|
| 8 |
# R1-AQA --- Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering
|
|
@@ -38,16 +41,16 @@ Additional Notes:
|
|
| 38 |
| Llama-3-8B-Instruct + Strong Cap. | Direct Inference\* | 50.75 | 49.10 | 48.93 | 48.93 | 55.25 | 62.70 | 52.10 | 53.57 |
|
| 39 |
| Qwen2-Audio-7B-Instruct | Direct Inference\* | 54.95 | 45.90 | 50.98 | 53.26 | 42.04 | 45.90 | 49.20 | 52.50 |
|
| 40 |
| SALAMONN | Direct Inference\* | 41.00 | 40.30 | 34.80 | 33.76 | 25.50 | 24.24 | 33.70 | 32.77 |
|
| 41 |
-
| Qwen2-Audio-7B-Instruct | CoTA
|
| 42 |
-
| Qwen2-Audio-7B-Instruct | Zero-Shot-CoT
|
| 43 |
| **Qwen2-Audio-7B-Instruct** | **GRPO (Ours) 1️⃣** | 69.37 | - | 66.77 | - | 57.36 | - | 64.50 | - |
|
| 44 |
| **Qwen2-Audio-7B-Instruct** | **GRPO (Ours) 2️⃣** | 68.77 | 69.76 | 64.37 | 61.40 | 63.66 | 62.70 | 65.60 | 64.36 |
|
| 45 |
|
| 46 |
#### Notes
|
| 47 |
|
| 48 |
\* The data are sourced from the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard).
|
| 49 |
-
|
| 50 |
-
|
| 51 |
1️⃣ It is the original model, identical to the one on Hugging Face and described in our technical report.
|
| 52 |
2️⃣ It is the model submitted to the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard), trained multiple times to achieve balanced results.
|
| 53 |
|
|
@@ -101,4 +104,4 @@ print(response)
|
|
| 101 |
year={2025},
|
| 102 |
url={https://github.com/xiaomi-research/r1-aqa; https://huggingface.co/mispeech/r1-aqa}
|
| 103 |
}
|
| 104 |
-
```
|
|
|
|
| 1 |
---
|
| 2 |
library_name: transformers
|
| 3 |
license: apache-2.0
|
|
|
|
| 4 |
pipeline_tag: audio-text-to-text
|
| 5 |
+
tags:
|
| 6 |
+
- audio-question-answering
|
| 7 |
+
- reinforcement-learning
|
| 8 |
+
- multimodal-llm
|
| 9 |
---
|
| 10 |
|
| 11 |
# R1-AQA --- Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering
|
|
|
|
| 41 |
| Llama-3-8B-Instruct + Strong Cap. | Direct Inference\* | 50.75 | 49.10 | 48.93 | 48.93 | 55.25 | 62.70 | 52.10 | 53.57 |
|
| 42 |
| Qwen2-Audio-7B-Instruct | Direct Inference\* | 54.95 | 45.90 | 50.98 | 53.26 | 42.04 | 45.90 | 49.20 | 52.50 |
|
| 43 |
| SALAMONN | Direct Inference\* | 41.00 | 40.30 | 34.80 | 33.76 | 25.50 | 24.24 | 33.70 | 32.77 |
|
| 44 |
+
| Qwen2-Audio-7B-Instruct | CoTA [1] | 60.06 | - | 64.30 | - | 60.70 | - | 61.71 | - |
|
| 45 |
+
| Qwen2-Audio-7B-Instruct | Zero-Shot-CoT [2] | 61.86 | - | 56.29 | - | 55.26 | - | 57.80 | - |
|
| 46 |
| **Qwen2-Audio-7B-Instruct** | **GRPO (Ours) 1️⃣** | 69.37 | - | 66.77 | - | 57.36 | - | 64.50 | - |
|
| 47 |
| **Qwen2-Audio-7B-Instruct** | **GRPO (Ours) 2️⃣** | 68.77 | 69.76 | 64.37 | 61.40 | 63.66 | 62.70 | 65.60 | 64.36 |
|
| 48 |
|
| 49 |
#### Notes
|
| 50 |
|
| 51 |
\* The data are sourced from the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard).
|
| 52 |
+
[1] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
|
| 53 |
+
[2] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
|
| 54 |
1️⃣ It is the original model, identical to the one on Hugging Face and described in our technical report.
|
| 55 |
2️⃣ It is the model submitted to the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard), trained multiple times to achieve balanced results.
|
| 56 |
|
|
|
|
| 104 |
year={2025},
|
| 105 |
url={https://github.com/xiaomi-research/r1-aqa; https://huggingface.co/mispeech/r1-aqa}
|
| 106 |
}
|
| 107 |
+
```
|