---
datasets:
- DBD-research-group/BirdSet
pipeline_tag: audio-classification
library_name: transformers
tags:
- audio-classification
- audio
---
# Bird-MAE-Base: Can Masked Autoencoders Also Listen to Birds?
- **Paper**: [ArXiv](https://arxiv.org/abs/2504.12880)
- **Repo**: [GitHub](https://github.com/DBD-research-group/Bird-MAE)
## Abstract
Masked Autoencoders (MAEs) have shown competitive results in audio classification by learning rich semantic representations through an efficient self-supervised reconstruction task. However, general-purpose models fail to generalize well when applied directly to fine-grained audio domains. Specifically, bird-sound classification requires distinguishing subtle inter-species differences and managing high intra-species acoustic variability, thereby revealing the performance limitations of general-domain Audio-MAE models. This work demonstrates that bridging this domain gap requires more than domain-specific pretraining data; adapting the entire training pipeline is crucial. We systematically revisit and adapt the pretraining recipe, fine-tuning methods, and frozen feature utilization to bird sounds using BirdSet, a large-scale bioacoustic dataset comparable to AudioSet. Our resulting Bird-MAE achieves new state-of-the-art results in BirdSet's multi-label classification benchmark. Additionally, we introduce the parameter-efficient prototypical probing, enhancing the utility of frozen MAE representations and closely approaching fine-tuning performance in low-resource settings. Bird-MAE's prototypical probes outperform linear probing by up to 37%_\text{p} in MAP and narrow the gap to fine-tuning to approximately 3.3%_\text{p} on average across BirdSet downstream tasks. Bird-MAE also demonstrates robust few-shot capabilities with prototypical probing in our newly established few-shot benchmark on BirdSet, highlighting the potential of tailored self-supervised learning pipelines for fine-grained audio domains.
### Evaluation Results
**Table 1**
Probing results on the multi-label classification benchmark BirdSet with full data (MAP%).
Comparison of linear probing vs. prototypical probing using frozen encoder representations. Models follow
the evaluation protocol of BirdSet. **Best** and results are highlighted.
| Model | Arch. | Probing | HSNval | POW | PER | NES | UHH | NBP | SSW | SNE |
|-------------|-----------|---------|--------|-------|-------|-------|-------|-------|-------|-------|
| BirdAVES | HUBERT | linear | 14.91 | 12.60 | 5.41 | 6.36 | 11.76 | 33.68 | 4.55 | 7.86 |
| BirdAVES | HUBERT | proto | 32.52 | 19.98 | 5.14 | 11.87 | 15.41 | 39.85 | 7.71 | 9.59 |
| SimCLR | CvT-13 | linear | 17.29 | 17.89 | 6.66 | 10.64 | 7.43 | 26.35 | 6.99 | 8.92 |
| SimCLR | CvT-13 | proto | 18.00 | 17.02 | 3.37 | 7.91 | 7.08 | 26.60 | 5.36 | 8.83 |
| Audio-MAE | ViT-B/16 | linear | 8.77 | 10.36 | 3.72 | 4.48 | 10.78 | 24.70 | 2.50 | 5.60 |
| Audio-MAE | ViT-B/16 | proto | 19.42 | 19.58 | 9.34 | 15.53 | 16.84 | 35.32 | 8.81 | 12.34 |
| Bird-MAE | ViT-B/16 | linear | 13.06 | 14.28 | 5.63 | 8.16 | 14.75 | 34.57 | 5.59 | 8.16 |
| Bird-MAE | ViT-B/16 | proto | 43.84 | 37.67 | 20.72 | 28.11 | 26.46 | 62.68 | 22.69 | 22.16 |
| Bird-MAE | ViT-B/16 | linear | 12.44 | 16.20 | 6.63 | 8.31 | 15.41 | 41.91 | 5.75 | 7.94 |
| Bird-MAE | ViT-B/16 | proto | **49.97** | **51.73** | **31.38** | **37.80** | **29.97** | **69.50** | **37.74** | **29.96** |
| Bird-MAE | ViT-L/16 | linear | 13.25 | 14.82 | 7.29 | 7.93 | 12.99 | 38.71 | 5.60 | 7.84 |
| Bird-MAE | ViT-L/16 | proto | 47.52 | 49.65 | 30.43 | 35.85 | 28.91 | 69.13 | 35.83 | 28.31 |
For more details refer to the paper provided.
## Example
This model can be easily loaded and used for inference with the `transformers` library.
> Note that this is the base model and you need to finetune the classification head.
> We provide the option to use a Linear and Proto Probing head.
```python
from transformers import AutoFeatureExtractor, AutoModel
import librosa
# Load the model and feature extractor
model = AutoModel.from_pretrained("DBD-research-group/Bird-MAE-Huge",trust_remote_code=True)
feature_extractor = AutoFeatureExtractor.from_pretrained("DBD-research-group/Bird-MAE-Huge", trust_remote_code=True)
model.eval()
# Load an example audio file
audio_path = librosa.ex('robin')
# The model is trained on audio sampled at 32,000 Hz
audio, sample_rate = librosa.load(audio_path, sr=32_000)
mel_spectrogram = feature_extractor(audio)
# embedding with shape corresponding to model size
embedding = model(mel_spectrogram)
```
## Citation
```
@misc{rauch2025audiomae,
title={Can Masked Autoencoders Also Listen to Birds?},
author={Lukas Rauch and René Heinrich and Ilyass Moummad and Alexis Joly and Bernhard Sick and Christoph Scholz},
year={2025},
eprint={2504.12880},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.12880},
}
```