A newer version of this model is available: kingabzpro/whisper-large-v3-turbo-urdu

wav2vec2-large-xls-r-300m-Urdu

This model is a fine-tuned version of facebook/wav2vec2-xls-r-300m on the common_voice dataset. It achieves the following results on the evaluation set:

Loss: 0.9889
Wer: 0.5607
Cer: 0.2370

Evaluation Commands

To evaluate on mozilla-foundation/common_voice_8_0 with split test

python eval.py --model_id kingabzpro/wav2vec2-large-xls-r-300m-Urdu --dataset mozilla-foundation/common_voice_8_0 --config ur --split test

Inference With LM

# pip install transformers datasets pyctcdecode kenlm huggingface_hub torch

import json, torch
from datasets import load_dataset, Audio
from transformers import AutoProcessor, AutoModelForCTC
from pyctcdecode import build_ctcdecoder
from huggingface_hub import hf_hub_download

mid = "kingabzpro/wav2vec2-large-xls-r-300m-Urdu"
proc = AutoProcessor.from_pretrained(mid)
model = AutoModelForCTC.from_pretrained(mid).eval().to(
    "cuda" if torch.cuda.is_available() else "cpu"
)

kenlm = hf_hub_download(mid, "language_model/5gram.bin")
uni  = hf_hub_download(mid, "language_model/unigrams.txt")
try: attrs = json.load(open(hf_hub_download(mid, "language_model/attrs.json"), encoding="utf-8"))
except: attrs = {}

v = proc.tokenizer.get_vocab()
id2tok = [t for t,i in sorted(v.items(), key=lambda x:x[1])]
blank = proc.tokenizer.pad_token_id; wdt = proc.tokenizer.word_delimiter_token
keep, labels = zip(*[
    (i, "" if i==blank else " " if t==wdt else t)
    for i,t in enumerate(id2tok) if (i==blank or t==wdt or len(t)==1)
])

dec = build_ctcdecoder(list(labels), kenlm_model_path=kenlm,
                       unigrams=open(uni,encoding="utf-8").read().splitlines())
dec.alpha, dec.beta = attrs.get("alpha",0.5), attrs.get("beta",1.0)

ds = load_dataset("mozilla-foundation/common_voice_22_0", "ur", split="test", streaming=True)
ex = next(iter(ds.cast_column("audio", Audio(sampling_rate=16_000))))
x = proc(ex["audio"]["array"], sampling_rate=16_000, return_tensors="pt").input_values.to(model.device)

with torch.no_grad():
    logits = model(x).logits[0].cpu().numpy()[:, keep]
print(dec.decode(logits))

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 32
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 64
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 1000
num_epochs: 200

Training results

Training Loss	Epoch	Step	Validation Loss	Wer	Cer
3.6398	30.77	400	3.3517	1.0	1.0
2.9225	61.54	800	2.5123	1.0	0.8310
1.2568	92.31	1200	0.9699	0.6273	0.2575
0.8974	123.08	1600	0.9715	0.5888	0.2457
0.7151	153.85	2000	0.9984	0.5588	0.2353
0.6416	184.62	2400	0.9889	0.5607	0.2370