--- datasets: - DBD-research-group/BirdSet pipeline_tag: audio-classification library_name: transformers tags: - audio-classification - audio --- # Bird-MAE-Base: Can Masked Autoencoders Also Listen to Birds? - **Paper**: [ArXiv](https://arxiv.org/abs/2504.12880) - **Repo**: [GitHub](https://github.com/DBD-research-group/Bird-MAE) ## Abstract Masked Autoencoders (MAEs) have shown competitive results in audio classification by learning rich semantic representations through an efficient self-supervised reconstruction task. However, general-purpose models fail to generalize well when applied directly to fine-grained audio domains. Specifically, bird-sound classification requires distinguishing subtle inter-species differences and managing high intra-species acoustic variability, thereby revealing the performance limitations of general-domain Audio-MAE models. This work demonstrates that bridging this domain gap requires more than domain-specific pretraining data; adapting the entire training pipeline is crucial. We systematically revisit and adapt the pretraining recipe, fine-tuning methods, and frozen feature utilization to bird sounds using BirdSet, a large-scale bioacoustic dataset comparable to AudioSet. Our resulting Bird-MAE achieves new state-of-the-art results in BirdSet's multi-label classification benchmark. Additionally, we introduce the parameter-efficient prototypical probing, enhancing the utility of frozen MAE representations and closely approaching fine-tuning performance in low-resource settings. Bird-MAE's prototypical probes outperform linear probing by up to 37%_\text{p} in MAP and narrow the gap to fine-tuning to approximately 3.3%_\text{p} on average across BirdSet downstream tasks. Bird-MAE also demonstrates robust few-shot capabilities with prototypical probing in our newly established few-shot benchmark on BirdSet, highlighting the potential of tailored self-supervised learning pipelines for fine-grained audio domains. ### Evaluation Results **Table 1** Probing results on the multi-label classification benchmark BirdSet with full data (MAP%). Comparison of linear probing vs. prototypical probing using frozen encoder representations. Models follow the evaluation protocol of BirdSet. **Best** and results are highlighted. | Model | Arch. | Probing | HSNval | POW | PER | NES | UHH | NBP | SSW | SNE | |-------------|-----------|---------|--------|-------|-------|-------|-------|-------|-------|-------| | BirdAVES | HUBERT | linear | 14.91 | 12.60 | 5.41 | 6.36 | 11.76 | 33.68 | 4.55 | 7.86 | | BirdAVES | HUBERT | proto | 32.52 | 19.98 | 5.14 | 11.87 | 15.41 | 39.85 | 7.71 | 9.59 | | SimCLR | CvT-13 | linear | 17.29 | 17.89 | 6.66 | 10.64 | 7.43 | 26.35 | 6.99 | 8.92 | | SimCLR | CvT-13 | proto | 18.00 | 17.02 | 3.37 | 7.91 | 7.08 | 26.60 | 5.36 | 8.83 |
| Audio-MAE | ViT-B/16 | linear | 8.77 | 10.36 | 3.72 | 4.48 | 10.78 | 24.70 | 2.50 | 5.60 | | Audio-MAE | ViT-B/16 | proto | 19.42 | 19.58 | 9.34 | 15.53 | 16.84 | 35.32 | 8.81 | 12.34 |
| Bird-MAE | ViT-B/16 | linear | 13.06 | 14.28 | 5.63 | 8.16 | 14.75 | 34.57 | 5.59 | 8.16 | | Bird-MAE | ViT-B/16 | proto | 43.84 | 37.67 | 20.72 | 28.11 | 26.46 | 62.68 | 22.69 | 22.16 | | Bird-MAE | ViT-B/16 | linear | 12.44 | 16.20 | 6.63 | 8.31 | 15.41 | 41.91 | 5.75 | 7.94 | | Bird-MAE | ViT-B/16 | proto | **49.97** | **51.73** | **31.38** | **37.80** | **29.97** | **69.50** | **37.74** | **29.96** | | Bird-MAE | ViT-L/16 | linear | 13.25 | 14.82 | 7.29 | 7.93 | 12.99 | 38.71 | 5.60 | 7.84 | | Bird-MAE | ViT-L/16 | proto | 47.52 | 49.65 | 30.43 | 35.85 | 28.91 | 69.13 | 35.83 | 28.31 | For more details refer to the paper provided. ## Example This model can be easily loaded and used for inference with the `transformers` library. > Note that this is the base model and you need to finetune the classification head. > We provide the option to use a Linear and Proto Probing head. ```python from transformers import AutoFeatureExtractor, AutoModel import librosa # Load the model and feature extractor model = AutoModel.from_pretrained("DBD-research-group/Bird-MAE-Huge",trust_remote_code=True) feature_extractor = AutoFeatureExtractor.from_pretrained("DBD-research-group/Bird-MAE-Huge", trust_remote_code=True) model.eval() # Load an example audio file audio_path = librosa.ex('robin') # The model is trained on audio sampled at 32,000 Hz audio, sample_rate = librosa.load(audio_path, sr=32_000) mel_spectrogram = feature_extractor(audio) # embedding with shape corresponding to model size embedding = model(mel_spectrogram) ``` ## Citation ``` @misc{rauch2025audiomae, title={Can Masked Autoencoders Also Listen to Birds?}, author={Lukas Rauch and René Heinrich and Ilyass Moummad and Alexis Joly and Bernhard Sick and Christoph Scholz}, year={2025}, eprint={2504.12880}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2504.12880}, } ```