Audio-Text-to-Text
Transformers
Safetensors
qwen2_audio
text2text-generation
nielsr HF Staff commited on
Commit
5f4044d
·
verified ·
1 Parent(s): e1068f8

Add descriptive tags to model card

Browse files

This PR improves the model card by adding more descriptive tags to the metadata, such as `audio-question-answering`, `reinforcement-learning`, and `multimodal-llm`. These tags will help users discover the model more easily when searching on the Hub.

All existing content and links (to GitHub and the technical report on arXiv) remain unchanged as they are already accurate and complete for this model.

Files changed (1) hide show
  1. README.md +9 -6
README.md CHANGED
@@ -1,8 +1,11 @@
1
  ---
2
  library_name: transformers
3
  license: apache-2.0
4
- tags: []
5
  pipeline_tag: audio-text-to-text
 
 
 
 
6
  ---
7
 
8
  # R1-AQA --- Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering
@@ -38,16 +41,16 @@ Additional Notes:
38
  | Llama-3-8B-Instruct + Strong Cap. | Direct Inference\* | 50.75 | 49.10 | 48.93 | 48.93 | 55.25 | 62.70 | 52.10 | 53.57 |
39
  | Qwen2-Audio-7B-Instruct | Direct Inference\* | 54.95 | 45.90 | 50.98 | 53.26 | 42.04 | 45.90 | 49.20 | 52.50 |
40
  | SALAMONN | Direct Inference\* | 41.00 | 40.30 | 34.80 | 33.76 | 25.50 | 24.24 | 33.70 | 32.77 |
41
- | Qwen2-Audio-7B-Instruct | CoTA \[1\] | 60.06 | - | 64.30 | - | 60.70 | - | 61.71 | - |
42
- | Qwen2-Audio-7B-Instruct | Zero-Shot-CoT \[2\] | 61.86 | - | 56.29 | - | 55.26 | - | 57.80 | - |
43
  | **Qwen2-Audio-7B-Instruct** | **GRPO (Ours) 1️⃣** | 69.37 | - | 66.77 | - | 57.36 | - | 64.50 | - |
44
  | **Qwen2-Audio-7B-Instruct** | **GRPO (Ours) 2️⃣** | 68.77 | 69.76 | 64.37 | 61.40 | 63.66 | 62.70 | 65.60 | 64.36 |
45
 
46
  #### Notes
47
 
48
  \* The data are sourced from the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard).
49
- \[1\] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
50
- \[2\] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
51
  1️⃣ It is the original model, identical to the one on Hugging Face and described in our technical report.
52
  2️⃣ It is the model submitted to the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard), trained multiple times to achieve balanced results.
53
 
@@ -101,4 +104,4 @@ print(response)
101
  year={2025},
102
  url={https://github.com/xiaomi-research/r1-aqa; https://huggingface.co/mispeech/r1-aqa}
103
  }
104
- ```
 
1
  ---
2
  library_name: transformers
3
  license: apache-2.0
 
4
  pipeline_tag: audio-text-to-text
5
+ tags:
6
+ - audio-question-answering
7
+ - reinforcement-learning
8
+ - multimodal-llm
9
  ---
10
 
11
  # R1-AQA --- Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering
 
41
  | Llama-3-8B-Instruct + Strong Cap. | Direct Inference\* | 50.75 | 49.10 | 48.93 | 48.93 | 55.25 | 62.70 | 52.10 | 53.57 |
42
  | Qwen2-Audio-7B-Instruct | Direct Inference\* | 54.95 | 45.90 | 50.98 | 53.26 | 42.04 | 45.90 | 49.20 | 52.50 |
43
  | SALAMONN | Direct Inference\* | 41.00 | 40.30 | 34.80 | 33.76 | 25.50 | 24.24 | 33.70 | 32.77 |
44
+ | Qwen2-Audio-7B-Instruct | CoTA [1] | 60.06 | - | 64.30 | - | 60.70 | - | 61.71 | - |
45
+ | Qwen2-Audio-7B-Instruct | Zero-Shot-CoT [2] | 61.86 | - | 56.29 | - | 55.26 | - | 57.80 | - |
46
  | **Qwen2-Audio-7B-Instruct** | **GRPO (Ours) 1️⃣** | 69.37 | - | 66.77 | - | 57.36 | - | 64.50 | - |
47
  | **Qwen2-Audio-7B-Instruct** | **GRPO (Ours) 2️⃣** | 68.77 | 69.76 | 64.37 | 61.40 | 63.66 | 62.70 | 65.60 | 64.36 |
48
 
49
  #### Notes
50
 
51
  \* The data are sourced from the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard).
52
+ [1] Xie, Zhifei, et al. "Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models." arXiv preprint arXiv:2503.02318 (2025).
53
+ [2] Ma, Ziyang, et al. "Audio-CoT: Exploring Chain-of-Thought Reasoning in Large Audio Language Model." arXiv preprint arXiv:2501.07246 (2025).
54
  1️⃣ It is the original model, identical to the one on Hugging Face and described in our technical report.
55
  2️⃣ It is the model submitted to the [MMAU leaderboard](https://sakshi113.github.io/mmau_homepage/#leaderboard), trained multiple times to achieve balanced results.
56
 
 
104
  year={2025},
105
  url={https://github.com/xiaomi-research/r1-aqa; https://huggingface.co/mispeech/r1-aqa}
106
  }
107
+ ```