Spaces:
Running
on
Zero
Running
on
Zero
Audio Dataset Analysis Report
Executive Summary
Analysis of 40 open-source audio datasets for integration into the Music Generation Studio LoRA training system, considering HuggingFace Space limitations (1 GB storage).
Current Issues
- OpenSinger: Dataset ID
Rongjiehuang/opensingerdoes not exist on HuggingFace Hub - M4Singer: Dataset ID
M4Singer/M4Singernot found - Lakh MIDI: Dataset ID
roszcz/lakh-midimay not exist - Need to find verified HuggingFace dataset IDs
Recommended Datasets for Music Generation Training
Priority 1: Music & Singing (Fits 1GB limit)
GTZAN Music Genre Collection
- Size: ~1.2 GB (may need selective download)
- Content: 1,000 audio tracks across 10 music genres
- Use Case: Music style understanding, genre classification
- HF ID:
marsyas/gtzanor available on Kaggle - Recommendation: β β β β β - Perfect for music genre training
LJSpeech
- Size: ~2.6 GB
- Content: 13,100 short audio clips from single speaker
- Use Case: Voice/vocal training, prosody learning
- HF ID:
lj_speech - Recommendation: β β β β β - Good for vocal characteristics
NSynth
- Size: ~30 GB full (subset available)
- Content: 305,979 musical notes with unique pitch/timbre
- Use Case: Musical synthesis, instrument understanding
- HF ID:
google/nsynth(subset:nsynth-valid~1GB) - Recommendation: β β β β β - Excellent for music synthesis
MAESTRO (subset)
- Size: Full ~100GB, but can download specific splits
- Content: Piano performances with MIDI + audio
- Use Case: Music generation, MIDI-to-audio learning
- HF ID:
roszcz/maestro-v3 - Recommendation: β β β β β - Best for classical music training
MedleyDB (samples)
- Size: Varies by track selection
- Content: Annotated multi-track recordings
- Use Case: Instrument separation, music understanding
- HF ID: Custom download required
- Recommendation: β β β ββ - Good but requires manual setup
Priority 2: Vocal & Speech (Under 1GB)
Mozilla Common Voice (single language subset)
- Size: ~5GB per language (can use smaller languages)
- Content: Diverse speakers reading text
- Use Case: Vocal diversity, pronunciation
- HF ID:
mozilla-foundation/common_voice_11_0(specify language) - Recommendation: β β β β β - Great for vocal variation
VCTK Corpus
- Size: ~10.9 GB
- Content: 109 speakers with different accents
- Use Case: Voice diversity, accent variation
- HF ID:
vctk - Recommendation: β β β ββ - Good for voice training
CMU ARCTIC
- Size: ~3.5 GB
- Content: Multiple speakers, phonetically balanced
- Use Case: Speech synthesis, vocal training
- HF ID: Available via direct download
- Recommendation: β β β β β - High-quality vocals
Priority 3: Sound Effects & Environment (Under 1GB)
ESC-50
- Size: ~600 MB
- Content: 2,000 environmental sounds, 50 classes
- Use Case: Sound effects understanding
- HF ID:
ashraq/esc50 - Recommendation: β β β ββ - Good for ambient sounds
UrbanSound8K
- Size: ~6 GB
- Content: 8,732 urban sound excerpts
- Use Case: Environmental sound classification
- HF ID:
danavery/urbansound8k - Recommendation: β β β ββ - Urban ambient training
Verified HuggingFace Datasets for Immediate Use
Music Datasets
# GTZAN - Music Genre Classification
"marsyas/gtzan" # 1000 tracks, 10 genres
# NSynth - Musical Notes
"google/nsynth" # Use "nsynth-valid" split for smaller size
# MAESTRO - Piano performances
"roszcz/maestro-v3" # Download specific splits
Vocal Datasets
# LJSpeech - Single speaker
"lj_speech" # 13,100 clips
# Common Voice - Multilingual
"mozilla-foundation/common_voice_11_0" # Specify language
# LibriSpeech - English audiobooks (smaller subsets)
"librispeech_asr" # Use "clean" subsets only
Sound Effects
# ESC-50 - Environmental sounds
"ashraq/esc50" # 2000 samples, 50 classes
# FSD50K - Freesound Dataset
"Fhrozen/FSD50k" # Larger but comprehensive
Storage-Optimized Recommendations
For 1GB HuggingFace Space:
Best Combination (fits in 1GB):
- GTZAN subset (~300 MB) - 300 songs across all genres
- ESC-50 (~600 MB) - Environmental sounds
- LJSpeech subset (~100 MB) - 1000 clips for vocals
Alternative Combination:
- NSynth-valid (~800 MB) - Musical notes and synthesis
- Speech Commands (~200 MB) - Short vocal clips
Implementation Strategy
Phase 1: Quick Wins (Immediate)
- Replace broken dataset IDs with verified ones
- Implement GTZAN (marsyas/gtzan)
- Implement ESC-50 (ashraq/esc50)
- Add download size estimation before download
Phase 2: Smart Downloads (Next)
- Add dataset size checking
- Implement partial download (specific splits)
- Add storage quota monitoring
- Cache management for 1GB limit
Phase 3: Advanced Features
- Dataset preview/sampling before full download
- Automatic cleanup of old datasets
- Compression support
- Streaming data loading (no full download)
Updated Dataset Configuration
DATASETS = {
# Music Datasets (Verified)
"gtzan": {
"name": "GTZAN Music Genre (1000 tracks)",
"hf_id": "marsyas/gtzan",
"type": "music",
"size_gb": 1.2,
"description": "1000 songs across 10 genres for style learning"
},
"nsynth_valid": {
"name": "NSynth Validation Set (Musical Notes)",
"hf_id": "google/nsynth",
"split": "valid",
"type": "music",
"size_gb": 0.8,
"description": "Musical notes with unique pitch and timbre"
},
"maestro_small": {
"name": "MAESTRO Piano (Small subset)",
"hf_id": "roszcz/maestro-v3",
"split": "validation",
"type": "music",
"size_gb": 2.0,
"description": "Classical piano performances"
},
# Vocal Datasets (Verified)
"ljspeech": {
"name": "LJSpeech (13k vocal clips)",
"hf_id": "lj_speech",
"type": "vocal",
"size_gb": 2.6,
"description": "Single speaker for vocal characteristics"
},
"common_voice_en": {
"name": "Common Voice English (subset)",
"hf_id": "mozilla-foundation/common_voice_11_0",
"language": "en",
"type": "vocal",
"size_gb": 5.0,
"description": "Diverse English speakers"
},
# Sound Effects (Verified)
"esc50": {
"name": "ESC-50 Environmental Sounds",
"hf_id": "ashraq/esc50",
"type": "sound_effects",
"size_gb": 0.6,
"description": "2000 environmental sounds, 50 classes"
},
# Speech Commands (Verified)
"speech_commands": {
"name": "Google Speech Commands",
"hf_id": "speech_commands",
"type": "vocal",
"size_gb": 2.0,
"description": "Short spoken words for vocal training"
}
}
Conclusion
Immediate Actions:
- β Remove non-existent dataset IDs
- β Add verified HuggingFace datasets
- β Implement size checking before download
- β Add storage quota warnings
- β Focus on datasets under 1GB
Best Datasets for 1GB Limit:
- GTZAN (music genres)
- ESC-50 (sound effects)
- NSynth-valid (musical synthesis)
Total Storage Strategy:
- Max 1GB limit enforced
- Download size preview
- Selective split downloads
- Auto-cleanup old data