akera/whisper-large-v3-kin-200h-v2

This model is a fine-tuned Whisper large-v3 model for Automatic Speech Recognition (ASR) in Kinyarwanda, specifically trained with 200 hours of labeled speech data. It was developed as part of the research presented in the paper:

How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu

The paper investigates the minimum data volumes required for viable ASR performance in low-resource African languages, using Kinyarwanda and Kikuyu as case studies. It demonstrates that practical ASR performance (WER < 13%) can be achieved with as little as 50 hours of training data, with significant improvements up to 200 hours (WER < 10%). The research also highlights the critical role of data quality in achieving robust system performance.

For more details on the experiments, data preparation, and full code, please refer to the accompanying GitHub repository.

Training Configurations and Results

This model (akera/whisper-large-v3-kin-200h-v2) is one of several models trained and evaluated in the linked research. Below is a summary of the training configurations and their performance (WER, CER) on a dev_test[:300] subset, as reported in the GitHub repository.

Config	Hours	Model ID on Hugging Face	WER (%)	CER (%)	Score
`baseline.yaml`	0	`openai/whisper-large-v3`	33.10	9.80	0.861
`train_1h.yaml`	1	`akera/whisper-large-v3-kin-1h-v2`	47.63	16.97	0.754
`train_50h.yaml`	50	`akera/whisper-large-v3-kin-50h-v2`	12.51	3.31	0.932
`train_100h.yaml`	100	`akera/whisper-large-v3-kin-100h-v2`	10.90	2.84	0.943
`train_150h.yaml`	150	`akera/whisper-large-v3-kin-150h-v2`	10.21	2.64	0.948
`train_200h.yaml`	200	`akera/whisper-large-v3-kin-200h-v2`	9.82	2.56	0.951
`train_500h.yaml`	500	`akera/whisper-large-v3-kin-500h-v2`	8.24	2.15	0.963
`train_1000h.yaml`	1000	`akera/whisper-large-v3-kin-1000h-v2`	7.65	1.98	0.967
`train_full.yaml`	~1400	`akera/whisper-large-v3-kin-full`	7.14	1.88	0.970

Score = 1 - (0.6 × CER + 0.4 × WER)

Citation

@article{,
    title={How much speech data is necessary for ASR in African languages? An evaluation of data scaling in Kinyarwanda and Kikuyu},
    author={Benjamin Akera, Evelyn Nafula, Patrick Walukagga, Gilbert Yiga, John Quinn, Ernest Mwebaze},
    journal={arXiv preprint arXiv:2510.07221},
    year={2025}
}

Downloads last month: 7

Safetensors

Model size

2B params

Tensor type

F32