CLAPv2/SONYC-UST
Viewer • Updated • 58.7k • 64 • 1
How to use xd-br0/ast-sonyc-ust-v2 with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("audio-classification", model="xd-br0/ast-sonyc-ust-v2") # Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("xd-br0/ast-sonyc-ust-v2", dtype="auto")Fine-tuned Audio Spectrogram Transformer (AST) for hierarchical urban sound classification on the SONYC-UST dataset.
| Metric | Coarse (Macro) | Fine (Macro) |
|---|---|---|
| AUPRC ⭐ | 0.3184 | 0.3983 |
| AUC | 0.7828 | 0.9611 |
| F1 | 0.2070 | 0.1520 |
AUPRC (Average Precision) is the primary evaluation metric for DCASE urban sound tagging tasks.
| Parameter | Value |
|---|---|
| Batch size | 128 |
| Learning rate | 5e-5 |
| Epochs | 20 |
| Backbone | Frozen (last 4 layers unfrozen) |
| Mixed precision | Enabled (FP16) |
| Attention | SDPA (Scaled Dot Product) |
| Data loading | In-memory |
| Dataset normalization | SONYC-UST stats |
| Augmentation | SpecAugment + Mixup |
| Mixup α | 0.2 |
| Num workers | 8 |
from transformers import AutoConfig, AutoModel, AutoFeatureExtractor
from huggingface_hub import snapshot_download
import sys
# Download model files
repo_path = snapshot_download(
repo_id="xd-br0/ast-sonyc-ust-v2",
allow_patterns=["*.py", "*.json", "*.safetensors"]
)
# Add to Python path
sys.path.insert(0, repo_path)
# Import custom classes
from configuration_ast_sonyc import ASTSONYCConfig
from modeling_ast_sonyc import ASTSONYCModel
from pipeline_sonyc import SONYCAudioPipeline
# Load model
config = ASTSONYCConfig.from_pretrained(repo_path)
model = ASTSONYCModel.from_pretrained(repo_path, config=config)
feature_extractor = AutoFeatureExtractor.from_pretrained(repo_path)
# Create classifier
classifier = SONYCAudioPipeline(
model=model,
feature_extractor=feature_extractor
)
# Classify audio
results = classifier("urban_sound.wav", top_k=5, threshold=0.3)
# Display results
print("Fine-grained predictions:")
for pred in results["fine"]:
print(f"{pred['label']}: {pred['score']:.3f}")
print("\nCoarse-grained predictions:")
for pred in results["coarse"]:
print(f"{pred['label']}: {pred['score']:.3f}")
from transformers import AutoConfig, AutoModel, AutoFeatureExtractor
from huggingface_hub import snapshot_download
import sys
def load_ast_sonyc_classifier(repo_id="xd-br0/ast-sonyc-ust-v2"):
"""Load AST-SONYC classifier from HuggingFace Hub"""
# Download and setup
repo_path = snapshot_download(repo_id=repo_id, allow_patterns=["*.py", "*.json", "*.safetensors"])
sys.path.insert(0, repo_path)
# Import custom classes
from configuration_ast_sonyc import ASTSONYCConfig
from modeling_ast_sonyc import ASTSONYCModel
from pipeline_sonyc import SONYCAudioPipeline
# Load model
config = ASTSONYCConfig.from_pretrained(repo_path)
model = ASTSONYCModel.from_pretrained(repo_path, config=config)
feature_extractor = AutoFeatureExtractor.from_pretrained(repo_path)
return SONYCAudioPipeline(model=model, feature_extractor=feature_extractor)
# Usage
classifier = load_ast_sonyc_classifier()
results = classifier("audio.wav")
# Get all predictions regardless of confidence threshold
results = classifier("audio.wav", return_all=True)
# Or use a custom threshold
results = classifier("audio.wav", threshold=0.1, top_k=10)
Input Audio (16kHz, 10s)
↓
Audio Spectrogram Transformer (MIT/AST)
↓
Patch Embeddings (197 patches)
↓
Transformer Encoder (12 layers)
↓
[CLS] Token Pooling
↓
Hierarchical Classification
├─→ Fine-grained Head → 29 classes
└─→ Coarse Head → 8 classes
The model returns predictions in a hierarchical format:
{
"fine": [
{"label": "car-horn", "score": 0.89},
{"label": "siren", "score": 0.76},
...
],
"coarse": [
{"label": "alert-signal", "score": 0.92},
{"label": "engine", "score": 0.45},
...
]
}
If you use this model, please cite:
@inproceedings{bello2019sonyc,
title={SONYC Urban Sound Tagging (SONYC-UST): a multilabel dataset from an urban acoustic sensor network},
author={Bello, Juan Pablo and Silva, Claudio and Mydlarz, Charlie and Salamon, Justin and Doraiswamy, Harish and Arora, Aneesh and Cartwright, Mark},
booktitle={Proceedings of the Detection and Classification of Acoustic Scenes and Events Workshop (DCASE)},
pages={35--39},
year={2019}
}
@inproceedings{gong2021ast,
title={AST: Audio Spectrogram Transformer},
author={Gong, Yuan and Chung, Yu-An and Glass, James},
booktitle={Interspeech},
year={2021}
}
This model is released under the MIT License. See LICENSE for details.
For questions, issues, or contributions:
Note: This model is intended for research and educational purposes. For production deployment, please evaluate performance on your specific use case and environment.
Base model
MIT/ast-finetuned-audioset-10-10-0.4593