HiTZ/Aholab's Basque Speech-to-Text model Conformer-Transducer v2
Model Description
This model transcribes speech in lowercase Spanish+Basque alphabet including spaces, and was trained on a composite dataset comprising of 1366.18 hours of Basque, Spanish and bilingual (with code switching events) speech. The model was fine-tuned from a pre-trained Bilingual Basque-Spanish BBS-S2TC_conformer_transducer_large model using the Nvidia NeMo toolkit. It is an autoregressive "large" variant of Conformer, with around 119 million parameters. See the model architecture section and NeMo documentation for complete architecture details.
Usage
To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest PyTorch version.
pip install nemo_toolkit['all']
Transcribing using Python
Clone repository to download the model:
git clone https://huggingface.co/HiTZ/stt_eseu_conformer_transducer_large
Given NEMO_MODEL_FILEPATH is the path that points to the downloaded stt_eseu_conformer_transducer_large.nemo file.
import nemo.collections.asr as nemo_asr
# Load the model
asr_model = nemo_asr.models.EncDecRNNTBPEModel.restore_from(NEMO_MODEL_FILEPATH)
# Create a list pointing to the audio files
audio = ["audio_1.wav","audio_2.wav", ..., "audio_n.wav"]
# Fix the batch_size to whatever number suits your purpouse
batch_size = 8
# Transcribe the audio files
transcriptions = asr_model.transcribe(audio=audio, batch_size=batch_size)
# Visualize the transcriptions
print(transcriptions)
Input
This model accepts 16000 kHz Mono-channel Audio (wav files) as input.
Output
This model provides transcribed speech as a string for a given audio sample.
Model Architecture
Conformer-Transducer model is an autoregressive variant of Conformer model [1] for Automatic Speech Recognition which uses Transducer loss/decoding instead of CTC loss. You may find more info on the detail of this model here: Conformer-Transducer Model.
Training
Data preparation
This model has been trained on a composite dataset comprising 1366.18 hours of public available Basque, Spanish and Bilingual speech. More information about the data distribution of this corpus at Composite Corpus ESEU v1.0.
Training procedure
This model was trained starting from the pre-trained Bilingual Basque-Spanish model BBS-S2TC_conformer_transducer_large over several hundred of epochs in multiple GPU devices, using the NeMo toolkit [3] The tokenizer for these model was built using the text transcripts of the composite train dataset with this script, with a total of 256 basque language tokens.
Performance
Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding in the following table. ES--> Spanish, EU--> Basque, BI--> Bilingual(Basque-Spanish)
| Tokenizer | Vocabulary Size | MCV 18.0 ES Test | OpenSLR ES Test | Basque Parliament ES Test | MLS ES Test | VoxPopuli ES Test | MCV 18.0 EU Test | OpenSLR EU Test | Basque Parliament EU Test | Basque Parliament BI Test | MCV 18.0 ES Dev | OpenSLR ES Dev | Basque Parliament ES Dev | MLS ES Dev | VoxPopuli ES Dev | MCV 18.0 EU Dev | OpenSLR EU Dev | Basque Parliament EU Dev | Basque Parliament BI Dev | Train Dataset |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SentencePiece Unigram | 256 | 8.01 | 10.67 | 2.62 | 5.80 | 7.95 | 3.52 | 12.48 | 4.10 | 5.05 | 7.67 | 10.09 | 3.17 | 5.10 | 5.85 | 2.84 | 12.96 | 4.49 | 5.08 | Composite Dataset (1366.18 h) |
Limitations
Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.
Aditional Information
Author
HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory, University of the Basque Country UPV/EHU.
Licensing Information
Copyright (c) 2025 HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory, University of the Basque Country UPV/EHU.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Funding
This project with reference 2022/TL22/00215335 has been parcially funded by the Ministerio de Transformación Digital and by the Plan de Recuperación, Transformación y Resiliencia – Funded by the European Union – NextGenerationEU ILENIA and by the project IkerGaitu funded by the Basque Government. This model was trained at Hyperion, one of the high-performance computing (HPC) systems hosted by the DIPC Supercomputing Center.
References
- [1] Conformer: Convolution-augmented Transformer for Speech Recognition
- [2] Google Sentencepiece Tokenizer
- [3] NVIDIA NeMo Toolkit
Disclaimer
Click to expand
The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.
In no event shall the owner and creator of the models (HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory, University of the Basque Country UPV/EHU.) be liable for any results arising from the use made by third parties of these models.
- Downloads last month
- 12
Datasets used to train HiTZ/stt_eseu_conformer_transducer_large
Collection including HiTZ/stt_eseu_conformer_transducer_large
Evaluation results
- Test WER on Mozilla Common Voice 18.0 ESself-reported8.010
- Test WER on OpenSLR ESself-reported10.670
- Test WER on Basque Parliament ESself-reported2.620
- Test WER on MLS ESself-reported5.800
- Test WER on Facebook VoxPopuli ESself-reported7.950
- Test WER on Mozilla Common Voice 18.0 EUtest set self-reported3.520
- Test WER on OpenSLR EUself-reported12.480
- Test WER on Basque Parliament EUtest set self-reported4.100
- Test WER on Basque Parliament BItest set self-reported5.050
- Dev WER on Mozilla Common Voice 18.0 ESself-reported7.670