Spaces:
Runtime error
Runtime error
| title: Voice Clone TTS | |
| emoji: π | |
| colorFrom: green | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 5.41.1 | |
| app_file: app.py | |
| pinned: true | |
| short_description: mcp_server | |
| Looking at this code, it's a Text-to-Speech (TTS) application using the Zonos model. Let me provide explanations in both English and Korean. | |
| ## English Explanation | |
| ### Overview | |
| This is a Gradio-based web application for the **Zonos Text-to-Speech (TTS) Generator**. Zonos is an advanced TTS model from Zyphra that can generate natural-sounding speech with customizable voice characteristics. | |
| ### Key Features | |
| 1. **Model Selection** | |
| - Two model variants: Transformer and Hybrid | |
| - Different models have different conditioning capabilities | |
| 2. **Text Input & Language Support** | |
| - Supports multiple languages through eSpeak phoneme conversion | |
| - Text length limit of 500 characters | |
| - Language selection from supported language codes | |
| 3. **Voice Customization** | |
| - **Speaker Cloning**: Upload audio to clone a specific voice | |
| - **Voice Quality Settings**: | |
| - DNS-MOS (Voice Quality): 1.0-5.0 scale | |
| - Frequency Max: Control the highest frequency in Hz | |
| - Voice Clarity: Adjust voice intelligibility | |
| - Pitch Variation: Control how much the pitch varies | |
| - Speaking Rate: Adjust speech speed | |
| 4. **Emotion Control** | |
| - 8 emotion sliders: Happiness, Sadness, Disgust, Fear, Surprise, Anger, Other, Neutral | |
| - Fine-tune emotional expression in the generated speech | |
| 5. **Advanced Generation Parameters** | |
| - **Guidance Scale**: Controls how closely the model follows the conditioning | |
| - **Min P**: Controls randomness/creativity in generation | |
| - **Seed**: For reproducible results | |
| - **Prefix Audio**: Continue generation from existing audio | |
| 6. **Unconditional Generation** | |
| - Toggle specific conditions to let the model generate them automatically | |
| - Useful for more creative/varied outputs | |
| ### Technical Details | |
| - Uses GPU acceleration via CUDA | |
| - Implements classifier-free guidance for better control | |
| - Supports audio continuation from prefix | |
| - Real-time progress tracking during generation | |
| ### How to Use | |
| 1. Select a model variant | |
| 2. Enter your text and choose language | |
| 3. (Optional) Upload speaker audio for voice cloning | |
| 4. Adjust voice characteristics and emotions | |
| 5. Click "Generate Audio" to create speech | |
| 6. Download or play the generated audio | |
| --- | |
| ## νκΈ μ€λͺ | |
| ### κ°μ | |
| μ΄κ²μ **Zonos ν μ€νΈ μμ± λ³ν(TTS) μμ±κΈ°**λ₯Ό μν Gradio κΈ°λ° μΉ μ ν리μΌμ΄μ μ λλ€. Zonosλ Zyphraμμ κ°λ°ν κ³ κΈ TTS λͺ¨λΈλ‘, μ¬μ©μκ° μμ± νΉμ±μ 컀μ€ν°λ§μ΄μ§νμ¬ μμ°μ€λ¬μ΄ μμ±μ μμ±ν μ μμ΅λλ€. | |
| ### μ£Όμ κΈ°λ₯ | |
| 1. **λͺ¨λΈ μ ν** | |
| - λ κ°μ§ λͺ¨λΈ λ³ν: Transformerμ Hybrid | |
| - κ° λͺ¨λΈλ§λ€ λ€λ₯Έ μ‘°κ±΄λΆ κΈ°λ₯ μ 곡 | |
| 2. **ν μ€νΈ μ λ ₯ λ° μΈμ΄ μ§μ** | |
| - eSpeak μμ λ³νμ ν΅ν λ€κ΅μ΄ μ§μ | |
| - ν μ€νΈ κΈΈμ΄ μ ν: 500μ | |
| - μ§μλλ μΈμ΄ μ½λ μ€ μ ν κ°λ₯ | |
| 3. **μμ± μ»€μ€ν°λ§μ΄μ§** | |
| - **νμ 볡μ **: νΉμ μμ±μ 볡μ νκΈ° μν μ€λμ€ μ λ‘λ | |
| - **μμ± νμ§ μ€μ **: | |
| - DNS-MOS (μμ± νμ§): 1.0-5.0 μ²λ | |
| - μ΅λ μ£Όνμ: Hz λ¨μλ‘ μ΅κ³ μ£Όνμ μ μ΄ | |
| - μμ± λͺ λ£λ: μμ±μ μ΄ν΄λ μ‘°μ | |
| - μλμ΄ λ³ν: μλμ΄ λ³νλ μ μ΄ | |
| - λ°ν μλ: μμ± μλ μ‘°μ | |
| 4. **κ°μ μ μ΄** | |
| - 8κ°μ§ κ°μ μ¬λΌμ΄λ: ν볡, μ¬ν, νμ€, λλ €μ, λλ, λΆλ Έ, κΈ°ν, μ€λ¦½ | |
| - μμ±λ μμ±μ κ°μ ννμ μΈλ°νκ² μ‘°μ | |
| 5. **κ³ κΈ μμ± λ§€κ°λ³μ** | |
| - **κ°μ΄λμ€ μ€μΌμΌ**: λͺ¨λΈμ΄ 쑰건μ μΌλ§λ μΆ©μ€ν λ°λ₯Όμ§ μ μ΄ | |
| - **Min P**: μμ±μ 무μμμ±/μ°½μμ± μ μ΄ | |
| - **μλ**: μ¬ν κ°λ₯ν κ²°κ³Όλ₯Ό μν μ€μ | |
| - **ν리ν½μ€ μ€λμ€**: κΈ°μ‘΄ μ€λμ€μμ μ΄μ΄μ μμ± | |
| 6. **λ¬΄μ‘°κ±΄λΆ μμ±** | |
| - νΉμ 쑰건μ ν κΈνμ¬ λͺ¨λΈμ΄ μλμΌλ‘ μμ±νλλ‘ μ€μ | |
| - λ μ°½μμ μ΄κ³ λ€μν μΆλ ₯μ μ μ© | |
| ### κΈ°μ μ μΈλΆμ¬ν | |
| - CUDAλ₯Ό ν΅ν GPU κ°μ μ¬μ© | |
| - λ λμ μ μ΄λ₯Ό μν classifier-free guidance ꡬν | |
| - ν리ν½μ€μμ μ€λμ€ μ°μ μμ± μ§μ | |
| - μμ± μ€ μ€μκ° μ§ν μν© μΆμ | |
| ### μ¬μ© λ°©λ² | |
| 1. λͺ¨λΈ λ³ν μ ν | |
| 2. ν μ€νΈ μ λ ₯ λ° μΈμ΄ μ ν | |
| 3. (μ νμ¬ν) μμ± λ³΅μ λ₯Ό μν νμ μ€λμ€ μ λ‘λ | |
| 4. μμ± νΉμ± λ° κ°μ μ‘°μ | |
| 5. "Generate Audio" λ²νΌμ ν΄λ¦νμ¬ μμ± μμ± | |
| 6. μμ±λ μ€λμ€ λ€μ΄λ‘λ λλ μ¬μ | |
| ### νΉλ³ κΈ°λ₯ | |
| - **κ°μ μ€μ **: μμ±λ μμ±μ κ°μ ν€μ μΈλ°νκ² μ μ΄ | |
| - **μμ± νμ§**: DNS-MOS μ μλ‘ μμ± νμ§ μ‘°μ | |
| - **νμ λ Έμ΄μ¦ μ κ±°**: μ λ‘λλ νμ μ€λμ€μ λ Έμ΄μ¦ μ κ±° μ΅μ | |
| - **λ¬΄μ‘°κ±΄λΆ ν€**: νΉμ κΈ°λ₯μ μλμΌλ‘ μμ±νλλ‘ μ€μ | |
| μ΄ μ ν리μΌμ΄μ μ κ³ νμ§ TTS μμ±μ μν κ°λ ₯νκ³ μ μ°ν λꡬλ‘, λ€μν μ©λμ μμ± μ½ν μΈ μ μμ νμ©ν μ μμ΅λλ€. |