Few questions about the codec

by YatharthS - opened 5 days ago

Discussion

YatharthS

5 days ago

Thanks for the excellent codec, I just have a few questions.

Was training the backbone required? It seems only training the head and upsampler is truly required.
What losses were used?
Is there any reason for specifically choosing hop length of 98 or 147 in v1?

Thanks once again, your work is great.

OmniAICreator

NandemoGHS org 4 days ago

@YatharthS

Thanks for the thoughtful questions! Quick answers below.

1) Was training the backbone required?
In theory, fine‑tuning only the ISTFT head + upsampler can work. I didn’t have the compute to run full ablations, so I can’t claim it’s sufficient in all cases.
In practice I did a short warm‑up (~10% of steps) training only head+upsampler, then trained the whole decoder (the encoder was kept frozen to preserve the codebook). Training more of the decoder generally gave slightly better quality at the cost of compute.

2) What losses were used?
Same as XCodec2, plus the RMS loss from Inworld‑TTS. (No STFT loss.)

Phase 1 (head+upsampler only)
lambda_adv=0, lambda_disc=0, lambda_feat_match=0, lambda_mel=15, lambda_perceptual=0, lambda_rms=1, lambda_semantic=5
Phase 2 (full decoder)
lambda_adv=1, lambda_disc=1, lambda_feat_match=1, lambda_mel=15, lambda_perceptual=0.3 (0 until 75% of training, then linearly ramp to 0.3 by 85%), lambda_rms=1, lambda_semantic=5

3) Why hop length 98 or 147?

They satisfy the codec‑rate constraint
sample_rate = 50 * hop_length * (∏ upsample_strides).
For 44.1 kHz, both 98×[3,3] (U=9) and 147×[3,2] (U=6) give 44,100 Hz.
I didn’t run a thorough ablation. In v1 I sometimes heard faint electronic‑like artifacts (especially on sustained vowels), so I increased the frame density (e.g., 98×[3,3] at 44.1 kHz).

Hope this helps and thanks again for the kind words!

STK393

2 days ago

@OmniAICreator

Found your finetune a while ago and also wanted to express my gratitude! It works amazingly well, and is a surprisingly capable upsampler as well!
I'm using it for a passion project where I create voices for a game that was originally unvoiced. The results are already great, but I wondered if additional finetuning on the specific voices I'm using might improve the quality even more. The hyperparameters you posted are already a very good starting point, but before I assemble suitable training code myself, I thought why not ask you if you would be so kind as to publish/share your finetuning code somewhere.

Totally understand if you don't want to do that though! So in any case, thx again, great work!

OmniAICreator

NandemoGHS org 2 days ago

@STK393
Thanks so much! I’m really glad it’s useful!
I do plan to open‑source my fine‑tuning code, but it still needs a bit of cleanup (hard‑coded paths, logging, etc.), so I haven’t posted it yet😢. No ETA, but it’s on my list.

STK393

1 day ago

@OmniAICreator
Oh wow! That's great to hear! :D
Take your time. And if you ever need some help when it comes to testing or even cleanup etc, feel free to reach out!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment