CHUNK_SIZE in streaming diarization.

by anhalu - opened 23 days ago

23 days ago

You guys built a great model, I read through your paper and I’m glad to see you developed it.
Regarding the offline diarization model, it uses a lot of VRAM when processing long audio files (over 1 hour, etc.).

About the new model with chunk size. Can I increase it? The default max chunk size is 300, but I’d like to try 600 or 1200. Is there anything I should keep in mind about the other parameters when increasing the chunk size?
Thanks a lot.

imedennikov

NVIDIA org 22 days ago

•

edited 22 days ago

Hi @anhalu ,
Thank you for your interest in our work and your positive feedback!
To answer your question: yes, you can absolutely experiment with a larger chunk size, without changing the other parameters.

Two things to keep in mind:

Accuracy: Increasing chunk size generally improves diarization accuracy, but only to a certain limit. Further increase could cause a decrease in accuracy due to a mismatch with the training setup, which used a chunk size of 188 frames.
Speed & VRAM: A very large chunk size could reduce speed and increase VRAM consumption, because of the quadratic complexity of self-attention.

Given that, we've found the configuration with a chunk size of 340 frames to be the sweet spot between VRAM usage, speed and accuracy.

Hope this helps, and we'd be very interested to hear how your experiments turn out!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment