Robust Speech Recognition via Large-Scale Weak Supervision
Paper • 2212.04356 • Published • 54
# whisper-small-hindi
## Model Description
This is a fine-tuned version of the OpenAI [`whisper-small`](https://huggingface.co/openai/whisper-small) model, specifically adapted for Hindi speech recognition. The model was trained on the Hindi subset of the Mozilla Common Voice 17.0 dataset to improve transcription accuracy for Hindi audio.
---
## Intended Uses & Limitations
**Intended Uses:**
- Automatic speech recognition (ASR) for Hindi language audio.
- Speech-to-text transcription services targeting Hindi speakers.
- Integration into voice-enabled applications and platforms requiring Hindi transcription.
**Limitations:**
- The model’s performance depends on the diversity and quality of the training data.
- May not generalize well to Hindi dialects or accents not represented in Common Voice 17.0.
- Performance can degrade in noisy, overlapping, or highly conversational speech scenarios.
---
## Training and Evaluation Data
- **Training Dataset:** Mozilla Common Voice 17.0 (Hindi subset).
- **Evaluation Dataset:** Standard Common Voice 17.0 Hindi test split.
- Both datasets are publicly available and annotated for speech recognition tasks.
---
## Training Procedure
| Hyperparameter | Value |
|-----------------------------|----------------------------------------|
| Learning Rate | 1e-5 |
| Training Batch Size | 8 |
| Evaluation Batch Size | 8 |
| Gradient Accumulation Steps | 2 |
| Total Effective Batch Size | 16 |
| Optimizer | Adam (betas=(0.9, 0.999), epsilon=1e-8) |
| Learning Rate Scheduler | Linear |
| Warmup Steps | 250 |
| Total Training Steps | 1000 |
| Mixed Precision Training | Native AMP (Automatic Mixed Precision) |
---
## Training Results
- Training framework: Hugging Face Transformers 4.39.3, PyTorch 2.6.0+cu124.
- Tokenizers version 0.15.2.
- Achieved competitive Word Error Rate (WER) on Common Voice Hindi test set.
- For detailed evaluation metrics, please refer to the model card or contact the author.
---
## Usage
```python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torch
processor = WhisperProcessor.from_pretrained("bohraanuj23/whisper-small-hindi")
model = WhisperForConditionalGeneration.from_pretrained("bohraanuj23/whisper-small-hindi")
audio_input = ... # load your 16kHz audio array here
inputs = processor(audio_input, sampling_rate=16000, return_tensors="pt")
generated_ids = model.generate(inputs.input_features)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print("Transcription:", transcription)
If you use this model, please cite:
Apache License 2.0
For questions or issues, please open a GitHub issue or contact the author directly.
Base model
openai/whisper-small