Anime-XCodec2-44.1kHz-v2: A 44.1kHz Upsampling Variant of Anime-XCodec2 (v2)

License: CC BY-NC 4.0

TL;DR: Anime-XCodec2-44.1kHz-v2 is a fine-tuned variant of NandemoGHS/Anime-XCodec2. It incorporates upsampling layers and RMS loss (inspired by Inworld TTS-1) to produce 44.1kHz output, trained on ~22k hours of Japanese speech. This v2 updates upsampler parameters, loss configurations, and fixes a RoPE bug from the original XCodec2.

Only the decoder was updated; the encoder and codebook remain frozen, so speech tokens are identical to the original XCodec2. This makes the model a drop‑in decoder for downstream systems that already work with XCodec2 tokens (e.g., Llasa).


πŸ”— Quick Links


1) Model Summary

  • What it is: A neural speech codec based on Anime-XCodec2 (which is based on XCodec2), fine-tuned to output 44.1kHz high-fidelity Japanese speech (anime/game-style). (Version 2)
  • Key Change: Integrates an UpSamplerBlock and utilizes RMS Loss (inspired by Inworld TTS-1) into the decoder architecture.
  • Training scope: Decoder-only fine-tuning on ~22,000 hours of Japanese data. Encoder and codebook are frozen.
  • Compatibility: Speech tokens are identical to HKUSTAudio/xcodec2 and NandemoGHS/Anime-XCodec2.
  • Input Sampling rate: 16 kHz (for encoding, same as XCodec2).
  • Output Sampling rate: 44.1 kHz (decoded audio).

2) Intended Use

  • Decode XCodec2 speech tokens (e.g., from Llasa or other AR generators) into high-fidelity 44.1kHz Japanese speech (anime/game-style).
  • Upgrade existing Anime-XCodec2 (16kHz) pipelines to 44.1kHz output.
  • Audio Super-Resolution: As the model accepts 16kHz input and outputs 44.1kHz reconstructed audio, it can also be used as a form of audio super-resolution. However, its performance for this specific purpose is untested/unevaluated.

3) How to Use (Important)

This model modifies the original XCodec2 architecture (upsampler blocks) and requires a custom library version that includes a fix for the RoPE bug (Issue #36).

You MUST use the provided custom xcodec2 library fork (v0.1.7 or later) for inference. The standard library or older custom libraries (like 0.1.6) will not work.

  • Installation:

    # Install the custom xcodec2 library (v0.1.7)
    pip install https://huggingface.co/NandemoGHS/Anime-XCodec2-44.1kHz-v2/resolve/main/xcodec2-0.1.7.tar.gz
    
  • Usage: Once the custom library is installed, you can load and use this model just as you would the original XCodec2 or Anime-XCodec2 models. The core inference logic remains the same.

For a complete, working code example, please refer to my Hugging Face Spaces Demo: https://huggingface.co/spaces/OmniAICreator/Anime-XCodec2-44.1kHz-v2-Demo


4) Limitations & Trade-offs

  • Language scope: Optimized for Japanese. Performance on other languages may degrade.
  • Content domain: Tuned toward anime/game-style voices.
  • Library Dependency: Requires the specific custom xcodec2 library (v0.1.7) linked above. It is not compatible with the original xcodec2 library or previous custom forks (e.g., v0.1.6).

5) Data (High-Level)

  • ~22,000 hours of Japanese speech, with a focus on anime/game-style voices.
  • Data was prepared for 44.1kHz target output during training.

6) Training Procedure (High-Level)

  • Base Model: NandemoGHS/Anime-XCodec2 (16kHz)
  • Architecture Modification:
  • Loss Function:
    • Adopted RMS Loss (Root Mean Square loss) (from Inworld TTS-1), in addition to original losses.
  • Frozen: Encoder and Codebook (token compatibility preserved).
  • Updated (fine-tuned): generator.backbone, generator.head, generator.upsampler, fc_post_a

Key Updates in v2

Compared to the first version, this v2 model includes the following key updates to the training configuration:

  1. RoPE Bug Fix: Corrected a RoPE (Rotary Position Embedding) bug present in the original XCodec2 implementation (See Issue #36).
  2. Upsampler Parameters: The upsampler settings were changed to hop_length=98, upsample_factors=[3, 3], and kernel_sizes=[9, 9].
  3. Perceptual Loss Model: The model used for calculating perceptual loss was switched from facebook/wav2vec2-large-xlsr-53 to imprt/kushinada-hubert-large.
  4. Spectral Discriminator Tuning: The STFT (Short-Time Fourier Transform) settings for the spectral discriminator were adjusted to be more suitable for 44.1kHz high-sampling-rate audio.

7) License


8) Acknowledgements

  • HKUSTAudio/xcodec2 (Original model)
  • Inworld AI for their work on Inworld TTS-1 (Upsampler architecture and RMS Loss).
  • imprt for the kushinada-hubert-large model used in perceptual loss.
  • Thanks to contributors and the community around Japanese speech resources.
Downloads last month
551
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for NandemoGHS/Anime-XCodec2-44.1kHz-v2

Base model

HKUSTAudio/xcodec2
Finetuned
(2)
this model

Space using NandemoGHS/Anime-XCodec2-44.1kHz-v2 1