# Audio Dataset Analysis Report

## Executive Summary
Analysis of 40 open-source audio datasets for integration into the Music Generation Studio LoRA training system, considering HuggingFace Space limitations (1 GB storage).

## Current Issues
- **OpenSinger**: Dataset ID `Rongjiehuang/opensinger` does not exist on HuggingFace Hub
- **M4Singer**: Dataset ID `M4Singer/M4Singer` not found
- **Lakh MIDI**: Dataset ID `roszcz/lakh-midi` may not exist
- Need to find verified HuggingFace dataset IDs

## Recommended Datasets for Music Generation Training

### Priority 1: Music & Singing (Fits 1GB limit)

1. **GTZAN Music Genre Collection**
   - **Size**: ~1.2 GB (may need selective download)
   - **Content**: 1,000 audio tracks across 10 music genres
   - **Use Case**: Music style understanding, genre classification
   - **HF ID**: `marsyas/gtzan` or available on Kaggle
   - **Recommendation**: ★★★★★ - Perfect for music genre training

2. **LJSpeech**
   - **Size**: ~2.6 GB
   - **Content**: 13,100 short audio clips from single speaker
   - **Use Case**: Voice/vocal training, prosody learning
   - **HF ID**: `lj_speech`
   - **Recommendation**: ★★★★☆ - Good for vocal characteristics

3. **NSynth**
   - **Size**: ~30 GB full (subset available)
   - **Content**: 305,979 musical notes with unique pitch/timbre
   - **Use Case**: Musical synthesis, instrument understanding
   - **HF ID**: `google/nsynth` (subset: `nsynth-valid` ~1GB)
   - **Recommendation**: ★★★★★ - Excellent for music synthesis

4. **MAESTRO (subset)**
   - **Size**: Full ~100GB, but can download specific splits
   - **Content**: Piano performances with MIDI + audio
   - **Use Case**: Music generation, MIDI-to-audio learning
   - **HF ID**: `roszcz/maestro-v3`
   - **Recommendation**: ★★★★★ - Best for classical music training

5. **MedleyDB (samples)**
   - **Size**: Varies by track selection
   - **Content**: Annotated multi-track recordings
   - **Use Case**: Instrument separation, music understanding
   - **HF ID**: Custom download required
   - **Recommendation**: ★★★☆☆ - Good but requires manual setup

### Priority 2: Vocal & Speech (Under 1GB)

6. **Mozilla Common Voice (single language subset)**
   - **Size**: ~5GB per language (can use smaller languages)
   - **Content**: Diverse speakers reading text
   - **Use Case**: Vocal diversity, pronunciation
   - **HF ID**: `mozilla-foundation/common_voice_11_0` (specify language)
   - **Recommendation**: ★★★★☆ - Great for vocal variation

7. **VCTK Corpus**
   - **Size**: ~10.9 GB
   - **Content**: 109 speakers with different accents
   - **Use Case**: Voice diversity, accent variation
   - **HF ID**: `vctk`
   - **Recommendation**: ★★★☆☆ - Good for voice training

8. **CMU ARCTIC**
   - **Size**: ~3.5 GB
   - **Content**: Multiple speakers, phonetically balanced
   - **Use Case**: Speech synthesis, vocal training
   - **HF ID**: Available via direct download
   - **Recommendation**: ★★★★☆ - High-quality vocals

### Priority 3: Sound Effects & Environment (Under 1GB)

9. **ESC-50**
   - **Size**: ~600 MB
   - **Content**: 2,000 environmental sounds, 50 classes
   - **Use Case**: Sound effects understanding
   - **HF ID**: `ashraq/esc50`
   - **Recommendation**: ★★★☆☆ - Good for ambient sounds

10. **UrbanSound8K**
    - **Size**: ~6 GB
    - **Content**: 8,732 urban sound excerpts
    - **Use Case**: Environmental sound classification
    - **HF ID**: `danavery/urbansound8k`
    - **Recommendation**: ★★★☆☆ - Urban ambient training

## Verified HuggingFace Datasets for Immediate Use

### Music Datasets
```python
# GTZAN - Music Genre Classification
"marsyas/gtzan"  # 1000 tracks, 10 genres

# NSynth - Musical Notes
"google/nsynth"  # Use "nsynth-valid" split for smaller size

# MAESTRO - Piano performances
"roszcz/maestro-v3"  # Download specific splits
```

### Vocal Datasets
```python
# LJSpeech - Single speaker
"lj_speech"  # 13,100 clips

# Common Voice - Multilingual
"mozilla-foundation/common_voice_11_0"  # Specify language

# LibriSpeech - English audiobooks (smaller subsets)
"librispeech_asr"  # Use "clean" subsets only
```

### Sound Effects
```python
# ESC-50 - Environmental sounds
"ashraq/esc50"  # 2000 samples, 50 classes

# FSD50K - Freesound Dataset
"Fhrozen/FSD50k"  # Larger but comprehensive
```

## Storage-Optimized Recommendations

### For 1GB HuggingFace Space:

**Best Combination (fits in 1GB):**
1. **GTZAN subset** (~300 MB) - 300 songs across all genres
2. **ESC-50** (~600 MB) - Environmental sounds
3. **LJSpeech subset** (~100 MB) - 1000 clips for vocals

**Alternative Combination:**
1. **NSynth-valid** (~800 MB) - Musical notes and synthesis
2. **Speech Commands** (~200 MB) - Short vocal clips

## Implementation Strategy

### Phase 1: Quick Wins (Immediate)
- Replace broken dataset IDs with verified ones
- Implement GTZAN (marsyas/gtzan)
- Implement ESC-50 (ashraq/esc50)
- Add download size estimation before download

### Phase 2: Smart Downloads (Next)
- Add dataset size checking
- Implement partial download (specific splits)
- Add storage quota monitoring
- Cache management for 1GB limit

### Phase 3: Advanced Features
- Dataset preview/sampling before full download
- Automatic cleanup of old datasets
- Compression support
- Streaming data loading (no full download)

## Updated Dataset Configuration

```python
DATASETS = {
    # Music Datasets (Verified)
    "gtzan": {
        "name": "GTZAN Music Genre (1000 tracks)",
        "hf_id": "marsyas/gtzan",
        "type": "music",
        "size_gb": 1.2,
        "description": "1000 songs across 10 genres for style learning"
    },
    "nsynth_valid": {
        "name": "NSynth Validation Set (Musical Notes)",
        "hf_id": "google/nsynth",
        "split": "valid",
        "type": "music",
        "size_gb": 0.8,
        "description": "Musical notes with unique pitch and timbre"
    },
    "maestro_small": {
        "name": "MAESTRO Piano (Small subset)",
        "hf_id": "roszcz/maestro-v3",
        "split": "validation",
        "type": "music",
        "size_gb": 2.0,
        "description": "Classical piano performances"
    },
    
    # Vocal Datasets (Verified)
    "ljspeech": {
        "name": "LJSpeech (13k vocal clips)",
        "hf_id": "lj_speech",
        "type": "vocal",
        "size_gb": 2.6,
        "description": "Single speaker for vocal characteristics"
    },
    "common_voice_en": {
        "name": "Common Voice English (subset)",
        "hf_id": "mozilla-foundation/common_voice_11_0",
        "language": "en",
        "type": "vocal",
        "size_gb": 5.0,
        "description": "Diverse English speakers"
    },
    
    # Sound Effects (Verified)
    "esc50": {
        "name": "ESC-50 Environmental Sounds",
        "hf_id": "ashraq/esc50",
        "type": "sound_effects",
        "size_gb": 0.6,
        "description": "2000 environmental sounds, 50 classes"
    },
    
    # Speech Commands (Verified)
    "speech_commands": {
        "name": "Google Speech Commands",
        "hf_id": "speech_commands",
        "type": "vocal",
        "size_gb": 2.0,
        "description": "Short spoken words for vocal training"
    }
}
```

## Conclusion

**Immediate Actions:**
1. ✅ Remove non-existent dataset IDs
2. ✅ Add verified HuggingFace datasets
3. ✅ Implement size checking before download
4. ✅ Add storage quota warnings
5. ✅ Focus on datasets under 1GB

**Best Datasets for 1GB Limit:**
- **GTZAN** (music genres)
- **ESC-50** (sound effects)
- **NSynth-valid** (musical synthesis)

**Total Storage Strategy:**
- Max 1GB limit enforced
- Download size preview
- Selective split downloads
- Auto-cleanup old data