Urban Sound Tagging with Audio Spectrogram Transformer

Model Description

Fine-tuned Audio Spectrogram Transformer (AST) for hierarchical urban sound classification on the SONYC-UST dataset.

Base Model: MIT/ast-finetuned-audioset-10-10-0.4593
Dataset: SONYC-UST (Sounds of New York City - Urban Sound Tagging)
Task: Multi-label hierarchical audio classification
Fine-grained Classes: 29
Coarse Classes: 8

Performance

Test Set Results

Metric	Coarse (Macro)	Fine (Macro)
AUPRC ⭐	0.3184	0.3983
AUC	0.7828	0.9611
F1	0.2070	0.1520

AUPRC (Average Precision) is the primary evaluation metric for DCASE urban sound tagging tasks.

Training Configuration

Parameter	Value
Batch size	128
Learning rate	5e-5
Epochs	20
Backbone	Frozen (last 4 layers unfrozen)
Mixed precision	Enabled (FP16)
Attention	SDPA (Scaled Dot Product)
Data loading	In-memory
Dataset normalization	SONYC-UST stats
Augmentation	SpecAugment + Mixup
Mixup α	0.2
Num workers	8

Quick Start

from transformers import AutoConfig, AutoModel, AutoFeatureExtractor
from huggingface_hub import snapshot_download
import sys

# Download model files
repo_path = snapshot_download(
    repo_id="xd-br0/ast-sonyc-ust-v2",
    allow_patterns=["*.py", "*.json", "*.safetensors"]
)

# Add to Python path
sys.path.insert(0, repo_path)

# Import custom classes
from configuration_ast_sonyc import ASTSONYCConfig
from modeling_ast_sonyc import ASTSONYCModel
from pipeline_sonyc import SONYCAudioPipeline

# Load model
config = ASTSONYCConfig.from_pretrained(repo_path)
model = ASTSONYCModel.from_pretrained(repo_path, config=config)
feature_extractor = AutoFeatureExtractor.from_pretrained(repo_path)

# Create classifier
classifier = SONYCAudioPipeline(
    model=model,
    feature_extractor=feature_extractor
)

# Classify audio
results = classifier("urban_sound.wav", top_k=5, threshold=0.3)

# Display results
print("Fine-grained predictions:")
for pred in results["fine"]:
    print(f"{pred['label']}: {pred['score']:.3f}")

print("\nCoarse-grained predictions:")
for pred in results["coarse"]:
    print(f"{pred['label']}: {pred['score']:.3f}")

Alternative: One-liner Helper Function

from transformers import AutoConfig, AutoModel, AutoFeatureExtractor
from huggingface_hub import snapshot_download
import sys

def load_ast_sonyc_classifier(repo_id="xd-br0/ast-sonyc-ust-v2"):
    """Load AST-SONYC classifier from HuggingFace Hub"""
    # Download and setup
    repo_path = snapshot_download(repo_id=repo_id, allow_patterns=["*.py", "*.json", "*.safetensors"])
    sys.path.insert(0, repo_path)

    # Import custom classes
    from configuration_ast_sonyc import ASTSONYCConfig
    from modeling_ast_sonyc import ASTSONYCModel
    from pipeline_sonyc import SONYCAudioPipeline

    # Load model
    config = ASTSONYCConfig.from_pretrained(repo_path)
    model = ASTSONYCModel.from_pretrained(repo_path, config=config)
    feature_extractor = AutoFeatureExtractor.from_pretrained(repo_path)

    return SONYCAudioPipeline(model=model, feature_extractor=feature_extractor)

# Usage
classifier = load_ast_sonyc_classifier()
results = classifier("audio.wav")

Advanced: Get All Predictions

# Get all predictions regardless of confidence threshold
results = classifier("audio.wav", return_all=True)

# Or use a custom threshold
results = classifier("audio.wav", threshold=0.1, top_k=10)

Label Hierarchy

Coarse Classes (8)

alert-signal
dog
engine
human-voice
machinery-impact
music
non-machinery-impact
powered-saw

Fine-grained Classes (29)

amplified-speech
car-alarm
car-horn
chainsaw
dog-barking-whining
engine-of-uncertain-size
hoe-ram
ice-cream-truck
jackhammer
large-crowd
large-rotating-saw
large-sounding-engine
medium-sounding-engine
mobile-music
music-from-uncertain-source
non-machinery-impact
other-unknown-alert-signal
other-unknown-human-voice
other-unknown-impact-machinery
other-unknown-powered-saw
person-or-small-group-shouting
person-or-small-group-talking
pile-driver
reverse-beeper
rock-drill
siren
small-medium-rotating-saw
small-sounding-engine
stationary-music

Model Architecture

Input Audio (16kHz, 10s)
    ↓
Audio Spectrogram Transformer (MIT/AST)
    ↓
Patch Embeddings (197 patches)
    ↓
Transformer Encoder (12 layers)
    ↓
[CLS] Token Pooling
    ↓
Hierarchical Classification
    ├─→ Fine-grained Head → 29 classes
    └─→ Coarse Head → 8 classes

Training Details

Optimizer: AdamW with differential learning rates
- Backbone: 1e-5
- Classification heads: 1e-4
Loss Function: Hierarchical BCE with fine→coarse masking
Augmentation:
- SpecAugment (time + frequency masking)
- Mixup (α=0.2)
- Gaussian noise
Mixed Precision: FP16/BF16
Batch Size: 128
Scheduler: Cosine annealing with warmup

Output Format

The model returns predictions in a hierarchical format:

{
    "fine": [
        {"label": "car-horn", "score": 0.89},
        {"label": "siren", "score": 0.76},
        ...
    ],
    "coarse": [
        {"label": "alert-signal", "score": 0.92},
        {"label": "engine", "score": 0.45},
        ...
    ]
}

Limitations

Geographic Bias: Trained exclusively on New York City urban sounds; may not generalize well to other cities or rural environments
Fixed Duration: Designed for 10-second audio clips; longer/shorter clips may need preprocessing
Temporal Resolution: Cannot localize sounds within the 10-second window
Class Imbalance: Some classes (e.g., dog barking) have fewer training examples
Environmental Factors: Performance may degrade with:
- High background noise
- Multiple overlapping sound sources
- Weather conditions (wind, rain)
- Recording quality issues

Ethical Considerations

Privacy: Model can identify human voices but not speaker identity
Bias: Training data reflects NYC's specific urban soundscape
Intended Use: Environmental monitoring, noise pollution analysis, urban planning
Misuse Prevention: Should not be used for surveillance without proper consent

Citation

If you use this model, please cite:

@inproceedings{bello2019sonyc,
  title={SONYC Urban Sound Tagging (SONYC-UST): a multilabel dataset from an urban acoustic sensor network},
  author={Bello, Juan Pablo and Silva, Claudio and Mydlarz, Charlie and Salamon, Justin and Doraiswamy, Harish and Arora, Aneesh and Cartwright, Mark},
  booktitle={Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE)},
  pages={35--39},
  year={2019}
}

@inproceedings{gong2021ast,
  title={AST: Audio Spectrogram Transformer},
  author={Gong, Yuan and Chung, Yu-An and Glass, James},
  booktitle={Interspeech},
  year={2021}
}

License

This model is released under the MIT License. See LICENSE for details.

Acknowledgments

MIT for the pre-trained AST model
NYU SONYC team for the SONYC-UST dataset
HuggingFace for the Transformers library

Contact

For questions, issues, or contributions:

Model Repository: Gitlab
Issues: Please report bugs and feature requests on GitHub

Note: This model is intended for research and educational purposes. For production deployment, please evaluate performance on your specific use case and environment.

Downloads last month: 9

Safetensors

Model size

86.2M params

Tensor type

F32

Model tree for xd-br0/ast-sonyc-ust-v2

Base model

MIT/ast-finetuned-audioset-10-10-0.4593

Finetuned

(184)

this model

Dataset used to train xd-br0/ast-sonyc-ust-v2

Evaluation results

AUPRC on CLAPv2/SONYC-UST
self-reported

0.318
AUC on CLAPv2/SONYC-UST
self-reported

0.783
F1 on CLAPv2/SONYC-UST
self-reported

0.207
AUPRC on CLAPv2/SONYC-UST
self-reported

0.398
AUC on CLAPv2/SONYC-UST
self-reported

0.961
F1 on CLAPv2/SONYC-UST
self-reported

0.152