π Abstract
Most speech technology exists for high-resource languages, leaving low-resource languages like Idoma digitally underrepresented. This project bridges that gap by developing a fully functional Bidirectional English-Idoma Speech System.
The system utilizes an Applied Research Design to solve data scarcity and tonal complexity issues inherent in the Idoma language. It integrates three fine-tuned transformer models to enable natural conversation flow:
- ASR (Automatic Speech Recognition): Fine-tuned Wav2Vec 2.0 XLS-R.
- NMT (Neural Machine Translation): Fine-tuned NLLB-200.
- TTS (Text-to-Speech): Fine-tuned VITS-based MMS-TTS.
Core Components
- Speech-to-Text (STT): Converts input audio to text using Wav2Vec2 (Idoma) and Whisper Large v3 (English).
- Machine Translation (NMT): Translates text between English and Idoma using NLLB-200.
- Text-to-Speech (TTS): Synthesizes the translated text into speech using VITS (Idoma) and SpeechT5 (English).
π Performance Results
The system was rigorously tested against a held-out test set. Below are the quantitative evaluation metrics reported in the thesis:
| Component | Model Architecture | Metric | Score |
|---|---|---|---|
| Idoma STT | Wav2Vec 2.0 XLS-R | Word Error Rate (WER) | 11.43% |
| Character Error Rate | 3.5% | ||
| NMT Engine | NLLB-200 | BLEU | 31.42 |
| spBLEU | 33.25 | ||
| ChrF++ | 50.51 | ||
| Idoma TTS | VITS MMS-TTS | MOS (Intelligibility) | 4.36 / 5.0 |
π Software Artefacts & Models
All models and datasets developed during this research are hosted on Hugging Face and are automatically downloaded by this application upon first run.
| Function | Model Name | Hugging Face Link |
|---|---|---|
| STT | wav2vec2-xls-r-1b-finetuned-idoma | View Model |
| TTS | idoma-mms-tts-eng | View Model |
| NMT | idu-eng-translator | View Model |
Curated Datasets
- Adah-Idoma Dataset: mrheartng/adah-idoma
- Idoma TTS (Speaker 1): mrheartng/idoma-tts-speaker1
- Idoma STT (Multi-speaker): mrheartng/idoma-tts-multiple-speakers
π οΈ Installation & Usage
Prerequisites
- Python 3.8+
- CUDA-enabled GPU (Recommended for faster inference)
- FFmpeg (Required for audio processing)
Setup
Clone the repository:
git clone https://github.com/mrheart/idoma-english-bidirectional-tts-stt.git
cd idoma-english-bidirectional-tts-stt/idoma_translator
Install dependencies:
pip install -r requirements.txt
Run the application:
python app.py
Access the UI: Open your browser to the local Gradio URL provided in the terminal (usually http://127.0.0.1:7860).
π Project Structure
βββ config/ # Configuration settings and model IDs
βββ src/
β βββ services/ # Core inference logic (ASR, NMT, TTS)
β βββ model_loader.py # Singleton pattern for memory-efficient model loading
β βββ utils.py # Audio processing utilities
βββ tests/ # Unit tests
βββ app.py # Main entry point (Gradio Application)
βββ requirements.txt # Project dependencies
π Citation
If you use this code, models, or dataset in your research, please cite the thesis:
@mastersthesis{Ojabo2025Idoma,
author = {Ojabo, John Heart},
title = {Developing a Bidirectional English-Idoma Speech-to-Text (STT) and Text-to-Speech (TTS) System for Enhanced Communication and Language Preservation},
school = {University of Essex},
year = {2025}
}
π§ Contact
For questions regarding the dataset or model fine-tuning parameters, please open an issue in this repository or contact the author via Hugging Face.
- Downloads last month
- 2
Model tree for mrheartng/idu-eng-translator
Base model
facebook/nllb-200-1.3B