Model Card for ASR (CTC-based ASR on English)
This repository contains an end‑to‑end Automatic Speech Recognition (ASR) pipeline built around Hugging Face Transformers. The default configuration fine‑tunes facebook/wav2vec2-base-960h with a CTC head on  50k subsample of Common Voice 17.0 (English) and provides scripts to train, evaluate, export to ONNX, and deploy on AWS SageMaker. It also includes a robust audio loading stack (FFmpeg preferred, with fallbacks) and utilities for text normalization and evaluation (WER/CER).
Model Details
Model Description
- Developed by: Amirhossein Yousefi (GitHub: 
@amirhossein-yousefi) - Funded by : Not specified
 - Shared by : Amirhossein Yousefi
 - Model type: CTC-based ASR using Transformers (Wav2Vec2ForCTC)
 - Language(s) (NLP): English (
en) - License: Base model is Apache-2.0; repository/fine-tuned weights license not explicitly stated here (treat as other until clarified)
 - Finetuned from model : 
facebook/wav2vec2-base-960h 
The training/evaluation pipeline uses Hugging Face
transformers,datasets, andjiwerand includes scripts for inference and SageMaker deployment.
Model Sources
- Repository: https://github.com/amirhossein-yousefi/ASR
 - Paper : Baevski et al., “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” (arXiv:2006.11477)
 - Demo : N/A (local CLI and SageMaker examples included)
 
Uses
Direct Use
- General‑purpose English speech transcription for short to moderate audio segments (default duration filter: ~1–18 seconds).
 - Local batch transcription via CLI or Python, or real‑time deployment via AWS SageMaker (JSON base64 or raw WAV content types).
 
Downstream Use
- Domain adaptation / further fine‑tuning on task‑ or accent‑specific datasets.
 - Export to ONNX for CPU‑friendly inference and integration in production applications.
 
Out-of-Scope Use
- Speaker diarization, punctuation restoration, and true streaming ASR are not included.
 - Multilingual or code‑switched speech without additional fine‑tuning.
 - Very long files without chunking; heavy background noise without augmentation/tuning.
 
Bias, Risks, and Limitations
- The default fine‑tuning dataset (Common Voice 17.0, English) can reflect collection biases (microphone quality, accents, demographics). Accuracy may degrade on out‑of‑domain audio (e.g., telephony, medical terms).
 - Transcriptions may contain mistakes and can include sensitive/PII if present in audio; handle outputs responsibly.
 
Recommendations
- Always evaluate WER/CER on your own hold‑out data. Consider adding punctuation casing models and domain vocabularies as needed.
 - For regulated contexts, incorporate a human‑in‑the‑loop review and data governance.
 
How to Get Started with the Model
Python (local inference):
import torch, torchaudio
from transformers import AutoModelForCTC, AutoProcessor
model_dir = "./outputs/asr"  # or a Hugging Face hub id
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_dir)
model = AutoModelForCTC.from_pretrained(model_dir).to(device).eval()
wav, sr = torchaudio.load("path/to/file.wav")
target_sr = processor.feature_extractor.sampling_rate
if sr != target_sr:
    wav = torchaudio.functional.resample(wav, sr, target_sr)
inputs = processor(wav.squeeze(0).numpy(), sampling_rate=target_sr, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(**{k: v.to(device) for k, v in inputs.items()}).logits
pred_ids = torch.argmax(logits, dim=-1)
print(processor.batch_decode(pred_ids.cpu().numpy())[0])
CLI (example):
python src/infer.py --model_dir ./outputs/asr --audio path/to/file.wav
Training Details
Training Data
- Dataset: Common Voice 17.0 (English), text column: 
sentence - Duration filter: min ~1.0s, max ~18.0s
 - Notes: Case‑aware normalization, whitelist filtering to match tokenizer vocabulary; optional waveform augmentations.
 
Training Procedure
Preprocessing [optional]
- Robust audio decoding (FFmpeg preferred on Windows; fallback to 
torchaudio/soundfile/librosa), resampling to 16 kHz as required by Wav2Vec2. - Tokenization via the model’s processor; dynamic padding with a CTC collator.
 
Training Hyperparameters
- Epochs: 3
 - Per‑device batch size: 8 (× 8 grad accumulation → effective 64)
 - Learning rate: 3e‑5
 - Warmup ratio: 0.05
 - Optimizer: 
adamw_torch_fused - Weight decay: 0.0
 - Precision: FP16
 - Max grad norm: 1.0
 - Logging: every 50 steps; Eval/Save: every 500 steps; keep last 2 checkpoints; early stopping patience = 3
 - Seed: 42
 
Speeds, Sizes, Times [optional]
- Total FLOPs (training): 10,814,747,992,293,114,000
 - Training runtime: ~11,168 s for 2,346 steps
 - Logs: TensorBoard at 
src/output/logs(or similar path as configured) 
Evaluation
Testing Data, Factors & Metrics
- Metrics: WER (primary) and CER (auxiliary), computed with 
jiwerutilities. - Factors: English speech across CV17 splits; performance varies by accent, recording conditions, and utterance length.
 
Results
- Training includes loss, eval WER, and eval CER curves. See the 
assets/directory for plots. 
Summary
- Baseline WER/CER are logged per‑eval; users should report domain‑specific results on their own datasets.
 
Model Examination
- Greedy decoding by default; beam search/LM fusion is not included in this repo. Inspect logits and alignments if needed for error analysis.
 
Environmental Impact
- Hardware Type: Laptop (Windows)
 - GPU: NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM), Driver 576.52
 - CUDA / PyTorch: CUDA 12.9, PyTorch 2.8.0+cu129
 - Hours used: ~3.1 h (approx.)
 - Cloud Provider: N/A for local; AWS SageMaker utilities available for cloud training/deployment
 - Compute Region: N/A (local)
 - Carbon Emitted: Not calculated; estimate with the MLCO2 calculator
 
Technical Specifications
Model Architecture and Objective
- Architecture: Wav2Vec2 encoder with CTC output layer
 - Objective: Character‑level CTC loss for ASR
 
Compute Infrastructure
Hardware
- Local GPU as above; or AWS instance types via SageMaker scripts (e.g., 
ml.g4dn.xlarge). 
Software
- Python 3.10+
 - Key dependencies: 
transformers,datasets,torch,torchaudio,soundfile,librosa,jiwer,onnxruntime(for ONNX testing), andboto3/sagemakerfor deployment. 
Citation
BibTeX:
@article{baevski2020wav2vec,
  title={wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations},
  author={Baevski, Alexei and Zhou, Henry and Mohamed, Abdelrahman and Auli, Michael},
  journal={arXiv preprint arXiv:2006.11477},
  year={2020}
}
APA: Baevski, A., Zhou, H., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self‑supervised learning of speech representations. arXiv:2006.11477.
Glossary
- WER: Word Error Rate; lower is better.
 - CER: Character Error Rate; lower is better.
 - CTC: Connectionist Temporal Classification, an alignment‑free loss for sequence labeling.
 
More Information
- ONNX export: 
src/export_onnx.py - AWS SageMaker: scripts in 
sagemaker/for training, deployment, and autoscaling. - Training/metrics plots: see 
assets/(e.g.,train_loss.svg,eval_wer.svg,eval_cer.svg). 
Model Card Authors
- Amirhossein Yousefi (repo author)
 
Model Card Contact
- Open an issue on the GitHub repository: https://github.com/amirhossein-yousefi/ASR
 
- Downloads last month
 - -
 
Model tree for Amirhossein75/ASR
Base model
facebook/wav2vec2-base-960h