Questions about streaming with Parakeet and TDT merging methods
I’m currently trying to work with Parakeet in streaming mode, receiving microphone chunks and generating live transcriptions.
As a reference, I’m using the following code for streaming: https://github.com/NVIDIA/NeMo/blob/main/examples/asr/asr_chunked_inference/rnnt/speech_to_text_buffered_infer_rnnt.py
However, I’ve run into some questions:
Why do the more conventional merging methods not work well for TDT? I tested them, but the performance dropped significantly.
Is there already an implementation available for this use case (streaming with Parakeet using microphone chunks)?
I responded in the adjacent thread https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2/discussions/63#68cc58004fdfe65cc5d61be5
In brief:
- Please, use the new streaming pipeline https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/asr_chunked_inference/rnnt/speech_to_text_streaming_infer_rnnt.py
- You can try https://github.com/NVIDIA-NeMo/NeMo/pull/14759 as a reference for chunked streaming with microphone
Guys maybe this will help.
I finally managed to make the streaming with microphone gradio working. There are no errors regarding microphone now. I was also fighting with that problem a lot.
The space itself is not great, but the concept of streaming and gradio integration finally works.