Update README.md (#6)
Browse files- Update README.md (8c1459b2c0c2384cf288f48ddbec31641a578178)
Co-authored-by: Gang Li <[email protected]>
README.md
CHANGED
|
@@ -15,6 +15,17 @@ R1-AQA is a audio question answering (AQA) model based on `Qwen2-Audio-7B-Instru
|
|
| 15 |
This implementation has achieved state-of-the-art performance on MMAU *Test-mini* benchmark with only 38k post-training samples.
|
| 16 |
For more details, please refer to our [Github](https://github.com/xiaomi-research/r1-aqa) and [Technical Report](https://arxiv.org/abs/2503.11197).
|
| 17 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 18 |
### Table: Accuracies (%) on MMAU Test-mini benchmark
|
| 19 |
| Model | Method | Sound | Music | Speech | Average |
|
| 20 |
|--------------------------------------------|-------------------------|--------|--------|--------|---------|
|
|
|
|
| 15 |
This implementation has achieved state-of-the-art performance on MMAU *Test-mini* benchmark with only 38k post-training samples.
|
| 16 |
For more details, please refer to our [Github](https://github.com/xiaomi-research/r1-aqa) and [Technical Report](https://arxiv.org/abs/2503.11197).
|
| 17 |
|
| 18 |
+
Our main findings are as follows:
|
| 19 |
+
|
| 20 |
+
- The GRPO algorithm can be directly and effectively applied to the audio modality, even to `Qwen2-Audio-7B-Instruct` with only 8.2B parameters.
|
| 21 |
+
- With only 38k post-training samples, reinforcement learning outperforms supervised fine-tuning, indicating that RL-based approaches can be effective without large datasets.
|
| 22 |
+
- The explicit reasoning process has not shown significant benefits for AQA tasks, and how to efficiently leverage *deep thinking* or step-by-step reasoning remains an open question for further research.
|
| 23 |
+
- Large audio language models (LALMs) still lag far behind humans auditory-language reasoning, suggesting that the RL-based approaches warrant further explorations.
|
| 24 |
+
|
| 25 |
+
Additional Notes:
|
| 26 |
+
The AVQA training set originally consists of approximately 40k samples. However, we use only about 38k samples because some data sources have become invalid. Other datasets using YouTube sources face a similar issue, such as AudioSet. We believe that the missing 2k samples do not have a significant impact on the training results.
|
| 27 |
+
|
| 28 |
+
|
| 29 |
### Table: Accuracies (%) on MMAU Test-mini benchmark
|
| 30 |
| Model | Method | Sound | Music | Speech | Average |
|
| 31 |
|--------------------------------------------|-------------------------|--------|--------|--------|---------|
|