SSL-FT-PRON: Fine-tuned SSL Models for Automatic Pronunciation Assessment (APA)
A collection of fine-tuned Self-Supervised Learning (SSL) speech models (Wav2Vec2.0, HuBERT, WavLM) for Automatic Pronunciation Assessment (APA).
Three strategies are provided per backbone:
- CTC: ASR-style head trained with CTC
- Freeze: CNN feature extractor frozen; rest is fine-tuned
- General: no CTC head;
Important: This Hub repository is a collection. Each model lives in a subdirectory.
Load with the full sub-path, e.g.haeylee/ssl_ft_pron/wav2vec2/general/02_wav2vec2-large-960h.
Model Details
- Developed by: Haeyoung Lee (haeylee)
- Affiliation (paper): Seoul National University, SNU Spoken Language Processing Lab
- Model type: SSL speech encoders fine-tuned for APA (CTC / General / Freeze)
- Language(s): English (evaluated on Speechocean762)
- Finetuned from: See base_modellist above
Model Sources
- Code: https://github.com/hy310/ssl_finetuning
- Paper: Analysis of Various Self-Supervised Learning Models for Automatic Pronunciation Assessment (APSIPA ASC 2024)
Uses
- Research/prototyping for pronunciation scoring and representation analysis (e.g., PCA on hidden states).
- Feature extraction for downstream APA tasks.
Bias, Risks, and Limitations
- Trained/evaluated on Speechocean762 (read English by L2 speakers). Generalization to other languages/speaking styles is not guaranteed.
- APA relies on subjective human scores; apply domain calibration and monitor subgroup performance. Recommendation: Validate on in-domain data; report uncertainty and subgroup metrics.
How to Get Started
Load a CTC model (with CTC head)
from transformers import AutoModelForCTC, AutoProcessor
ckpt = "haeylee/ssl_ft_pron/wav2vec2/ctc/01_wav2vec2-large"
model = AutoModelForCTC.from_pretrained(ckpt)
processor = AutoProcessor.from_pretrained(ckpt)
Load a General / Freeze model (no CTC head)
from transformers import AutoProcessor, Wav2Vec2Model, HubertModel, WavLMModel
# Wav2Vec2 (General)
ckpt = "haeylee/ssl_ft_pron/wav2vec2/general/01_wav2vec2-large"
model = Wav2Vec2Model.from_pretrained(ckpt)
processor = AutoProcessor.from_pretrained(ckpt)
# HuBERT (Freeze)
# ckpt = "haeylee/ssl_ft_pron/hubert/freeze/06_hubert-large-ll60k"
# model = HubertModel.from_pretrained(ckpt)
# processor = AutoProcessor.from_pretrained(ckpt)
# WavLM (General)
# ckpt = "haeylee/ssl_ft_pron/wavlm/general/10_wavlm-large"
# model = WavLMModel.from_pretrained(ckpt)
# processor = AutoProcessor.from_pretrained(ckpt)
Summary:
- CTC: AutoModelForCTC.from_pretrained(...)
- General/Freeze: Wav2Vec2Model/HubertModel/WavLMModel.from_pretrained(...)
Training Details
Training Data
- Dataset: Speechocean762
- Preprocessing: We used preprocess_dataset.py(see the GitHub repo) to convert raw audio/labels into Hugging Facedatasetsformat.
Expected processed layout:
/your/data/path/speechocean762/
βββ preprocess/
    βββ speechocean_train_ds/
    βββ speechocean_test_ds/
Training Procedure
Preprocessing
# Adjust paths inside the script or via CLI args
python preprocess_dataset.py \
  --data_root /your/data/path/speechocean762 \
  --out_dir  /your/data/path/speechocean762/preprocess
General (no CTC head)
Loads encoders with Wav2Vec2Model / HubertModel / WavLMModel .from_pretrained(...) and trains a regression head to predict 4 APA scores.
python train/baseline.py \
  --model_name facebook/hubert-xlarge-ls960-ft \
  --batch_size 4 \
  --learning_rate 1e-5 \
  --num_train_epochs 30
Freeze (feature extractor frozen)
Same as General, but freezes the CNN feature extractor.
python train/freeze.py \
  --model_name facebook/hubert-xlarge-ls960-ft \
  --freeze_feature_extractor \
  --batch_size 4 \
  --learning_rate 1e-5 \
  --num_train_epochs 30
CTC (ASR-style head)
Uses AutoModelForCTC.from_pretrained(...) for CTC training.
python train/ctc.py \
  --model_name facebook/wav2vec2-large \
  --batch_size 4 \
  --learning_rate 1e-5 \
  --num_train_epochs 30
Artifacts saved: model.safetensors, trainer_state.json, training_args.bin, logs, and checkpoints (per run: args.json, trainer_args.json).
Evaluation
Testing Data, Factors & Metrics
- Test set: Speechocean762 (held-out split prepared by preprocess_dataset.py)
- Factors: Backbone (Wav2Vec2 / HuBERT / WavLM) Γ strategy (CTC / General / Freeze)
- Metric: pearsonr(Pearson correlation coefficient, PCC) for Accuracy, Fluency, Prosody, and Total.
Citation
@inproceedings{lee2024analysis,
  title={Analysis of Various Self-Supervised Learning Models for Automatic Pronunciation Assessment},
  author={Lee, Haeyoung and Kim, Sunhee and Chung, Minhwa},
  booktitle={2024 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)},
  pages={1--6},
  year={2024},
  organization={IEEE}
}
Authors & Contact
- Author: Haeyoung Lee (haeylee)
- Email: [email protected]
- Issues/Requests: https://github.com/hy310/ssl_finetuning
Model tree for haeylee/ssl_ft_pron
Base model
facebook/hubert-base-ls960