---
datasets:
- DBD-research-group/BirdSet
pipeline_tag: audio-classification
library_name: transformers
tags:
- audio-classification
- audio
---
# Bird-MAE-Base: Can Masked Autoencoders Also Listen to Birds?

- **Paper**: [ArXiv](https://arxiv.org/abs/2504.12880)
- **Repo**: [GitHub](https://github.com/DBD-research-group/Bird-MAE)


## Abstract

Masked Autoencoders (MAEs) have shown competitive results in audio classification by learning rich semantic representations through an efficient self-supervised reconstruction task. However, general-purpose models fail to generalize well when applied directly to fine-grained audio domains. Specifically, bird-sound classification requires distinguishing subtle inter-species differences and managing high intra-species acoustic variability, thereby revealing the performance limitations of general-domain Audio-MAE models. This work demonstrates that bridging this domain gap requires more than domain-specific pretraining data; adapting the entire training pipeline is crucial. We systematically revisit and adapt the pretraining recipe, fine-tuning methods, and frozen feature utilization to bird sounds using BirdSet, a large-scale bioacoustic dataset comparable to AudioSet. Our resulting Bird-MAE achieves new state-of-the-art results in BirdSet's multi-label classification benchmark. Additionally, we introduce the parameter-efficient prototypical probing, enhancing the utility of frozen MAE representations and closely approaching fine-tuning performance in low-resource settings. Bird-MAE's prototypical probes outperform linear probing by up to 37%_\text{p} in MAP and narrow the gap to fine-tuning to approximately 3.3%_\text{p} on average across BirdSet downstream tasks. Bird-MAE also demonstrates robust few-shot capabilities with prototypical probing in our newly established few-shot benchmark on BirdSet, highlighting the potential of tailored self-supervised learning pipelines for fine-grained audio domains. 

### Evaluation Results

**Table 1**

Probing results on the multi-label classification benchmark BirdSet with full data (MAP%).
Comparison of linear probing vs. prototypical probing using frozen encoder representations. Models follow
the evaluation protocol of BirdSet. **Best** and results are highlighted.

| Model       | Arch.     | Probing | HSNval | POW   | PER   | NES   | UHH   | NBP   | SSW   | SNE   |
|-------------|-----------|---------|--------|-------|-------|-------|-------|-------|-------|-------|
| BirdAVES    | HUBERT    | linear  | 14.91  | 12.60 | 5.41  | 6.36  | 11.76 | 33.68 | 4.55  | 7.86  |
| BirdAVES    | HUBERT    | proto   | 32.52  | 19.98 | 5.14  | 11.87 | 15.41 | 39.85 | 7.71  | 9.59  |
| SimCLR      | CvT-13    | linear  | 17.29  | 17.89 | 6.66  | 10.64 | 7.43  | 26.35 | 6.99  | 8.92  |
| SimCLR      | CvT-13    | proto   | 18.00  | 17.02 | 3.37  | 7.91  | 7.08  | 26.60 | 5.36  | 8.83  |
<br>
| Audio-MAE   | ViT-B/16  | linear  | 8.77   | 10.36 | 3.72  | 4.48  | 10.78 | 24.70 | 2.50  | 5.60  |
| Audio-MAE   | ViT-B/16  | proto   | 19.42  | 19.58 | 9.34  | 15.53 | 16.84 | 35.32 | 8.81  | 12.34 |
<br>
| Bird-MAE    | ViT-B/16  | linear  | 13.06  | 14.28 | 5.63  | 8.16  | 14.75 | 34.57 | 5.59  | 8.16  |
| Bird-MAE    | ViT-B/16  | proto   | 43.84  | 37.67 | 20.72 | 28.11 | 26.46 | 62.68 | 22.69 | 22.16 |
| Bird-MAE    | ViT-B/16  | linear  | 12.44  | 16.20 | 6.63  | 8.31  | 15.41 | 41.91 | 5.75  | 7.94  |
| Bird-MAE    | ViT-B/16  | proto   | **49.97** | **51.73** | **31.38** | **37.80** | **29.97** | **69.50** | **37.74** | **29.96** |
| Bird-MAE    | ViT-L/16  | linear  | 13.25  | 14.82 | 7.29  | 7.93  | 12.99 | 38.71 | 5.60  | 7.84  |
| Bird-MAE    | ViT-L/16  | proto   | 47.52  | 49.65 | 30.43 | 35.85 | 28.91 | 69.13 | 35.83 | 28.31 |

For more details refer to the paper provided.

## Example

This model can be easily loaded and used for inference with the `transformers` library.

> Note that this is the base model and you need to finetune the classification head.
> We provide the option to use a Linear and Proto Probing head.

```python
from transformers import AutoFeatureExtractor, AutoModel
import librosa

# Load the model and feature extractor
model = AutoModel.from_pretrained("DBD-research-group/Bird-MAE-Huge",trust_remote_code=True)
feature_extractor = AutoFeatureExtractor.from_pretrained("DBD-research-group/Bird-MAE-Huge", trust_remote_code=True)
model.eval()

# Load an example audio file
audio_path = librosa.ex('robin')

# The model is trained on audio sampled at 32,000 Hz
audio, sample_rate = librosa.load(audio_path, sr=32_000)

mel_spectrogram = feature_extractor(audio)

# embedding with shape corresponding to model size
embedding = model(mel_spectrogram) 
```

## Citation
```
@misc{rauch2025audiomae,
      title={Can Masked Autoencoders Also Listen to Birds?}, 
      author={Lukas Rauch and René Heinrich and Ilyass Moummad and Alexis Joly and Bernhard Sick and Christoph Scholz},
      year={2025},
      eprint={2504.12880},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.12880}, 
}
```