Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,169 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language:
|
| 3 |
+
- en
|
| 4 |
+
tags:
|
| 5 |
+
- text2text-generation
|
| 6 |
+
license: mit
|
| 7 |
+
datasets:
|
| 8 |
+
- PeacefulData/HyPoradise-v0
|
| 9 |
+
library_name: transformers
|
| 10 |
+
pipeline_tag: text2text-generation
|
| 11 |
+
widget:
|
| 12 |
+
- text: "Generate the correct transcription for the following n-best list of ASR hypotheses: \n\n1. nebode also typically is symphons and an ankle surf leash \n2. neboda is also typically is symphons and an ankle surf leash \n3. nebode also typically is swim fins and an ankle surf leash \n4. neboda also typically is symphons and an ankle surf leash \n5. neboda is also typically is swim fins and an ankle surf leash"
|
| 13 |
+
base_model:
|
| 14 |
+
- google/flan-t5-base
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# FLANEC: Exploring FLAN-T5 for Post-ASR Error Correction
|
| 18 |
+
|
| 19 |
+
## Model Overview
|
| 20 |
+
|
| 21 |
+
FLANEC is an encoder-decoder model based on FLAN-T5, specifically fine-tuned for post-Automatic Speech Recognition (ASR) error correction, also known as Generative Speech Error Correction (GenSEC). The model utilizes n-best hypotheses from ASR systems to enhance the accuracy and grammaticality of final transcriptions by generating a single corrected output. FLANEC models are trained on diverse subsets of the [HyPoradise dataset](https://huggingface.co/datasets/PeacefulData/HyPoradise-v0), leveraging multiple ASR domains to provide robust, scalable error correction across different types of audio data.
|
| 22 |
+
|
| 23 |
+
FLANEC was developed for the **GenSEC Task 1 challenge at SLT 2024** - [Challenge website](https://sites.google.com/view/gensec-challenge/home).
|
| 24 |
+
|
| 25 |
+
> **β οΈ IMPORTANT**: This repository contains the Single-Dataset (SD) versions of FLANEC models. Each model is trained on a single specific dataset from the HyPoradise collection, allowing for domain-specialized ASR error correction. For models trained on the cumulative dataset (CD), please see the related models section below.
|
| 26 |
+
|
| 27 |
+
## Repository Structure
|
| 28 |
+
|
| 29 |
+
This repository contains multiple model variants trained individually on each dataset from the HyPoradise collection:
|
| 30 |
+
|
| 31 |
+
```
|
| 32 |
+
flanec-sd-models/
|
| 33 |
+
βββ flanec-base-sd-ft/ # Base models (250M params) with full fine-tuning
|
| 34 |
+
β βββ atis/ # ATIS dataset model
|
| 35 |
+
β βββ chime4/ # CHiME-4 dataset model
|
| 36 |
+
β βββ ... # Other dataset models
|
| 37 |
+
βββ flanec-base-sd-lora/ # Base models with LoRA fine-tuning
|
| 38 |
+
βββ flanec-large-sd-ft/ # Large models (800M params) with full fine-tuning
|
| 39 |
+
βββ flanec-large-sd-lora/ # Large models with LoRA fine-tuning
|
| 40 |
+
βββ flanec-xl-sd-ft/ # XL models (3B params) with full fine-tuning
|
| 41 |
+
βββ flanec-xl-sd-lora/ # XL models with LoRA fine-tuning
|
| 42 |
+
```
|
| 43 |
+
|
| 44 |
+
Each dataset directory contains the best model checkpoint along with its tokenizer.
|
| 45 |
+
|
| 46 |
+
## Getting Started
|
| 47 |
+
|
| 48 |
+
### Cloning the Repository
|
| 49 |
+
|
| 50 |
+
**Warning**: This repository is very large due to containing multiple model variants across different sizes and datasets.
|
| 51 |
+
|
| 52 |
+
```bash
|
| 53 |
+
git clone https://huggingface.co/morenolq/flanec-sd-models
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
For more efficient cloning, you can use the Hugging Face CLI to clone only specific models:
|
| 57 |
+
|
| 58 |
+
```bash
|
| 59 |
+
# Install the Hugging Face Hub CLI if you haven't already
|
| 60 |
+
pip install -U "huggingface_hub[cli]"
|
| 61 |
+
|
| 62 |
+
# Clone only a specific model variant and dataset
|
| 63 |
+
huggingface-cli download morenolq/flanec-sd-models --include "flanec-base-sd-ft/atis/**" --local-dir flanec-sd-models
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
### Using a Model
|
| 67 |
+
|
| 68 |
+
To use a specific model:
|
| 69 |
+
|
| 70 |
+
```python
|
| 71 |
+
from transformers import T5ForConditionalGeneration, T5Tokenizer
|
| 72 |
+
|
| 73 |
+
# Choose a specific model path based on:
|
| 74 |
+
# 1. Model size (base, large, xl)
|
| 75 |
+
# 2. Training method (ft, lora)
|
| 76 |
+
# 3. Dataset (atis, wsj, chime4, etc.)
|
| 77 |
+
model_path = "path/to/flanec-sd-models/flanec-base-sd-ft/atis"
|
| 78 |
+
tokenizer = T5Tokenizer.from_pretrained(model_path)
|
| 79 |
+
model = T5ForConditionalGeneration.from_pretrained(model_path)
|
| 80 |
+
|
| 81 |
+
# Example input with n-best ASR hypotheses
|
| 82 |
+
input_text = """Generate the correct transcription for the following n-best list of ASR hypotheses:
|
| 83 |
+
|
| 84 |
+
1. i need to fly from dallas to chicago next monday
|
| 85 |
+
2. i need to fly from dallas to chicago next thursday
|
| 86 |
+
3. i need to fly from dallas to chicago on monday
|
| 87 |
+
4. i need to fly dallas to chicago next monday
|
| 88 |
+
5. i need to fly from dallas chicago next monday"""
|
| 89 |
+
|
| 90 |
+
input_ids = tokenizer(input_text, return_tensors="pt").input_ids
|
| 91 |
+
outputs = model.generate(input_ids, max_length=128)
|
| 92 |
+
corrected_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
|
| 93 |
+
print(corrected_text)
|
| 94 |
+
```
|
| 95 |
+
|
| 96 |
+
## Model Variants
|
| 97 |
+
|
| 98 |
+
### Available Model Sizes
|
| 99 |
+
|
| 100 |
+
- **Base**: ~250 million parameters
|
| 101 |
+
- **Large**: ~800 million parameters
|
| 102 |
+
- **XL**: ~3 billion parameters
|
| 103 |
+
|
| 104 |
+
### Training Methods
|
| 105 |
+
|
| 106 |
+
- **Full Fine-tuning (ft)**: All model parameters are updated during training
|
| 107 |
+
- **LoRA (lora)**: Low-Rank Adaptation for parameter-efficient fine-tuning
|
| 108 |
+
|
| 109 |
+
### Datasets
|
| 110 |
+
|
| 111 |
+
All models are trained on specific subsets of the HyPoradise dataset:
|
| 112 |
+
|
| 113 |
+
1. **WSJ**: Business and financial news.
|
| 114 |
+
2. **ATIS**: Airline travel queries.
|
| 115 |
+
3. **CHiME-4**: Noisy speech.
|
| 116 |
+
4. **Tedlium-3**: TED talks.
|
| 117 |
+
5. **CV-accent**: Accented speech.
|
| 118 |
+
6. **SwitchBoard**: Conversational speech.
|
| 119 |
+
7. **LRS2**: BBC program audio.
|
| 120 |
+
8. **CORAAL**: Accented speech from African American English.
|
| 121 |
+
|
| 122 |
+
For more details on each dataset, see the [HyPoradise paper](https://proceedings.neurips.cc/paper_files/paper/2023/hash/6492267465a7ac507be1f9fd1174e78d-Abstract-Datasets_and_Benchmarks.html).
|
| 123 |
+
|
| 124 |
+
## Related Models
|
| 125 |
+
|
| 126 |
+
If you're looking for models trained on the combined datasets (Cumulative Dataset models), please check:
|
| 127 |
+
|
| 128 |
+
**Full Fine-tuning (FT) Cumulative Dataset Models:**
|
| 129 |
+
- [FLANEC Base CD](https://huggingface.co/morenolq/flanec-base-cd): Base model fine-tuned on all domains.
|
| 130 |
+
- [FLANEC Large CD](https://huggingface.co/morenolq/flanec-large-cd): Large model fine-tuned on all domains.
|
| 131 |
+
- [FLANEC XL CD](https://huggingface.co/morenolq/flanec-xl-cd): Extra-large model fine-tuned on all domains.
|
| 132 |
+
|
| 133 |
+
**LoRA Cumulative Dataset Models:**
|
| 134 |
+
- [FLANEC Base LoRA CD](https://huggingface.co/morenolq/flanec-base-cd-lora): Base model with LoRA fine-tuning.
|
| 135 |
+
- [FLANEC Large LoRA CD](https://huggingface.co/morenolq/flanec-large-cd-lora): Large model with LoRA fine-tuning.
|
| 136 |
+
- [FLANEC XL LoRA CD](https://huggingface.co/morenolq/flanec-xl-cd-lora): XL model with LoRA fine-tuning.
|
| 137 |
+
|
| 138 |
+
## Performance Overview
|
| 139 |
+
|
| 140 |
+
Our research demonstrated that:
|
| 141 |
+
|
| 142 |
+
- Single-dataset models excel at their specific domains but may not generalize well to others
|
| 143 |
+
- Larger models generally deliver better performance within their domain
|
| 144 |
+
- Full fine-tuning typically outperforms LoRA, especially for larger models
|
| 145 |
+
- The CORAAL dataset presents unique challenges across all model configurations
|
| 146 |
+
|
| 147 |
+
For detailed performance metrics and analysis, please see the [FlanEC paper](https://arxiv.org/abs/2501.12979).
|
| 148 |
+
|
| 149 |
+
## Intended Use
|
| 150 |
+
|
| 151 |
+
FLANEC is designed for the task of Generative Speech Error Correction (GenSEC). The models are suitable for post-processing ASR outputs to correct grammatical and linguistic errors. The models support the **English** language.
|
| 152 |
+
|
| 153 |
+
## Citation
|
| 154 |
+
|
| 155 |
+
- [arXiv version](https://arxiv.org/abs/2501.12979)
|
| 156 |
+
- [IEEEXplore version](https://ieeexplore.ieee.org/document/10832257)
|
| 157 |
+
|
| 158 |
+
Please use the following citation to reference this work in your research:
|
| 159 |
+
|
| 160 |
+
```bibtex
|
| 161 |
+
@article{quatra_2024_flanec:,
|
| 162 |
+
author = {Moreno La Quatra and Valerio Mario Salerno and Yu Tsao and Sabato Marco Siniscalchi},
|
| 163 |
+
title = {FlanEC: Exploring Flan-T5 for Post-ASR Error Correction},
|
| 164 |
+
journal = {2024 IEEE Spoken Language Technology Workshop (SLT)},
|
| 165 |
+
year = {2024},
|
| 166 |
+
doi = {10.1109/slt61566.2024.10832257},
|
| 167 |
+
url = {https://doi.org/10.1109/slt61566.2024.10832257}
|
| 168 |
+
}
|
| 169 |
+
```
|