Anime-XCodec2-44.1kHz-v2: A 44.1kHz Upsampling Variant of Anime-XCodec2 (v2)
TL;DR: Anime-XCodec2-44.1kHz-v2 is a fine-tuned variant of NandemoGHS/Anime-XCodec2. It incorporates upsampling layers and RMS loss (inspired by Inworld TTS-1) to produce 44.1kHz output, trained on ~22k hours of Japanese speech. This v2 updates upsampler parameters, loss configurations, and fixes a RoPE bug from the original XCodec2.
Only the decoder was updated; the encoder and codebook remain frozen, so speech tokens are identical to the original XCodec2. This makes the model a dropβin decoder for downstream systems that already work with XCodec2 tokens (e.g., Llasa).
π Quick Links
- Demo (Gradio / Hugging Face Spaces): https://huggingface.co/spaces/OmniAICreator/Anime-XCodec2-44.1kHz-v2-Demo
- This repository (v2 44.1kHz fine-tune):
NandemoGHS/Anime-XCodec2-44.1kHz-v2 - Baseline 16kHz model:
NandemoGHS/Anime-XCodec2 - Original XCodec2:
HKUSTAudio/xcodec2 - Reference Paper (Inworld TTS-1): https://arxiv.org/abs/2507.21138
- Reference Implementation (Inworld TTS): https://github.com/inworld-ai/tts
1) Model Summary
- What it is: A neural speech codec based on Anime-XCodec2 (which is based on XCodec2), fine-tuned to output 44.1kHz high-fidelity Japanese speech (anime/game-style). (Version 2)
- Key Change: Integrates an UpSamplerBlock and utilizes RMS Loss (inspired by Inworld TTS-1) into the decoder architecture.
- Training scope: Decoder-only fine-tuning on ~22,000 hours of Japanese data. Encoder and codebook are frozen.
- Compatibility: Speech tokens are identical to
HKUSTAudio/xcodec2andNandemoGHS/Anime-XCodec2. - Input Sampling rate: 16 kHz (for encoding, same as XCodec2).
- Output Sampling rate: 44.1 kHz (decoded audio).
2) Intended Use
- Decode XCodec2 speech tokens (e.g., from Llasa or other AR generators) into high-fidelity 44.1kHz Japanese speech (anime/game-style).
- Upgrade existing
Anime-XCodec2(16kHz) pipelines to 44.1kHz output. - Audio Super-Resolution: As the model accepts 16kHz input and outputs 44.1kHz reconstructed audio, it can also be used as a form of audio super-resolution. However, its performance for this specific purpose is untested/unevaluated.
3) How to Use (Important)
This model modifies the original XCodec2 architecture (upsampler blocks) and requires a custom library version that includes a fix for the RoPE bug (Issue #36).
You MUST use the provided custom xcodec2 library fork (v0.1.7 or later) for inference. The standard library or older custom libraries (like 0.1.6) will not work.
Installation:
# Install the custom xcodec2 library (v0.1.7) pip install https://huggingface.co/NandemoGHS/Anime-XCodec2-44.1kHz-v2/resolve/main/xcodec2-0.1.7.tar.gzUsage: Once the custom library is installed, you can load and use this model just as you would the original XCodec2 or Anime-XCodec2 models. The core inference logic remains the same.
For a complete, working code example, please refer to my Hugging Face Spaces Demo: https://huggingface.co/spaces/OmniAICreator/Anime-XCodec2-44.1kHz-v2-Demo
4) Limitations & Trade-offs
- Language scope: Optimized for Japanese. Performance on other languages may degrade.
- Content domain: Tuned toward anime/game-style voices.
- Library Dependency: Requires the specific custom
xcodec2library (v0.1.7) linked above. It is not compatible with the originalxcodec2library or previous custom forks (e.g., v0.1.6).
5) Data (High-Level)
- ~22,000 hours of Japanese speech, with a focus on anime/game-style voices.
- Data was prepared for 44.1kHz target output during training.
6) Training Procedure (High-Level)
- Base Model:
NandemoGHS/Anime-XCodec2(16kHz) - Architecture Modification:
- Integrated the
UpSamplerBlockfrom the Inworld TTS-1 implementation into the decoder.
- Integrated the
- Loss Function:
- Adopted RMS Loss (Root Mean Square loss) (from Inworld TTS-1), in addition to original losses.
- Frozen: Encoder and Codebook (token compatibility preserved).
- Updated (fine-tuned):
generator.backbone,generator.head,generator.upsampler,fc_post_a
Key Updates in v2
Compared to the first version, this v2 model includes the following key updates to the training configuration:
- RoPE Bug Fix: Corrected a RoPE (Rotary Position Embedding) bug present in the original XCodec2 implementation (See Issue #36).
- Upsampler Parameters: The upsampler settings were changed to
hop_length=98,upsample_factors=[3, 3], andkernel_sizes=[9, 9]. - Perceptual Loss Model: The model used for calculating perceptual loss was switched from facebook/wav2vec2-large-xlsr-53 to imprt/kushinada-hubert-large.
- Spectral Discriminator Tuning: The STFT (Short-Time Fourier Transform) settings for the spectral discriminator were adjusted to be more suitable for 44.1kHz high-sampling-rate audio.
7) License
- CC-BY-NC 4.0 (inherited from XCodec2 and Anime-XCodec2).
- See: https://creativecommons.org/licenses/by-nc/4.0/
8) Acknowledgements
- HKUSTAudio/xcodec2 (Original model)
- Inworld AI for their work on Inworld TTS-1 (Upsampler architecture and RMS Loss).
- imprt for the
kushinada-hubert-largemodel used in perceptual loss. - Thanks to contributors and the community around Japanese speech resources.
- Downloads last month
- 551