--- title: VoiceKit MCP emoji: 🎤 colorFrom: purple colorTo: indigo sdk: gradio sdk_version: "6.0.0" app_file: app.py pinned: false tags: - building-mcp-track-creative - mcp-server --- # 🎤 VoiceKit MCP > **Professional voice analysis as MCP tools — extract embeddings, compare voices, transcribe speech, and more.** 6 powerful MCP tools for voice processing, all accepting base64-encoded audio. 📢 **Social Post:** [View on X](https://x.com/dahee_pk/status/1994389505898582442)
🎬 **Demo Video:** [Watch on YouTube](https://www.youtube.com/watch?v=1VIqvpwfyWU)
👥 **Team:** [@EricYoun](https://huggingface.co/EricYoun), [@NickEo](https://huggingface.co/NickEo), [@HYENA-WON](https://huggingface.co/HYENA-WON), [@jjin6573](https://huggingface.co/jjin6573), [@cocoajoa](https://huggingface.co/cocoajoa) --- ## 📋 Submission Info | | | |---|---| | **Track** | Building MCP — Creative | | **MCP Endpoint** | `https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse` | | **Framework** | Gradio 6.0 | --- ## ✅ Track 1 Requirements | Requirement | How We Fulfill It | |-------------|-------------------| | **Functioning MCP Server** | 6 MCP tools exposed via Gradio's `mcp_server=True` | | **MCP Client Demo** | Video shows integration with Claude Desktop / MCP client | | **Documented Tools** | Full API documentation with inputs/outputs below | | **Gradio App** | Interactive demo UI + hidden MCP tool interfaces | --- ## 🛠️ MCP Tools (6 Tools) All tools accept **base64-encoded audio** as input. ### 1. `extract_embedding` Extract voice embeddings using Wav2Vec2 model. | | | |---|---| | **Input** | `audio_base64` (base64-encoded audio) | | **Output** | `embedding_preview` (first 5 values), `embedding_length` (768) | | **Use Case** | Speaker identification, voice fingerprinting | ### 2. `match_voice` Compare similarity between two voices. | | | |---|---| | **Inputs** | `audio1_base64`, `audio2_base64` | | **Output** | `similarity` (0-1), `tone_score` (0-100) | | **Use Case** | Voice cloning verification, speaker matching | ### 3. `analyze_acoustics` Extract detailed acoustic characteristics. | | | |---|---| | **Input** | `audio_base64` | | **Output** | Pitch, energy, rhythm, tempo, spectral info | | **Use Case** | Emotional tone detection, voice profiling | ### 4. `transcribe_audio` Convert speech to text (multilingual). | | | |---|---| | **Inputs** | `audio_base64`, `language` (default: "en") | | **Output** | Transcribed text, detected language | | **Model** | ElevenLabs Scribe v1 | | **Languages** | English, Korean, Japanese, and 15+ more | ### 5. `isolate_voice` Remove background music/noise and extract clean voice. | | | |---|---| | **Input** | `audio_base64` (audio with background sounds) | | **Output** | Isolated audio (base64), BGM detection status | | **Use Case** | Audio cleanup for memes, songs, movies | ### 6. `grade_voice` Comprehensive voice comparison with multi-metric scoring. | | | |---|---| | **Inputs** | `user_audio_base64`, `reference_audio_base64`, `reference_text` (optional), `category` (meme\|song\|movie) | | **Output** | Pitch, rhythm, energy, pronunciation scores (0-100), overall score, user transcription | | **Use Case** | Voice mimicry evaluation, pronunciation games | --- ## 🏗️ Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ VoiceKit MCP │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ MCP Client (Claude) │ │ │ │ base64 audio → SSE endpoint │ │ │ └──────────────────────────┬─────────────────────────────────┘ │ │ ↓ │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Gradio MCP Server (app.py) │ │ │ │ mcp_server=True • 6 tool interfaces │ │ │ └──────────────────────────┬─────────────────────────────────┘ │ │ ↓ │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Modal GPU Container (T4) │ │ │ │ Wav2Vec2 • librosa • ElevenLabs APIs • DTW │ │ │ └──────────────────────────┬─────────────────────────────────┘ │ │ ↓ │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ JSON Response │ │ │ │ embeddings • scores • transcripts • audio │ │ │ └────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## 🔌 How to Connect ### Claude Desktop / MCP Client Add to your MCP configuration: ```json { "mcpServers": { "voicekit": { "url": "https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse" } } } ``` ### Example Usage ```python # 1. Encode audio to base64 import base64 with open("audio.wav", "rb") as f: audio_base64 = base64.b64encode(f.read()).decode() # 2. Call MCP tool result = mcp_client.call("extract_embedding", {"audio_base64": audio_base64}) # 3. Use the 768-dim embedding embedding = result["embedding"] ``` --- ## 🛠️ Tech Stack | Component | Technology | |-----------|------------| | MCP Server | Gradio 6.0 (`mcp_server=True`) | | GPU Compute | Modal (T4 GPU) | | Embeddings | Wav2Vec2 (facebook/wav2vec2-base-960h) | | Speech-to-Text | ElevenLabs Scribe v1 | | Voice Isolation | ElevenLabs Voice Isolator | | Acoustic Analysis | librosa + scipy | --- ## ⚡ Performance | Metric | Value | |--------|-------| | Response Time (warm) | <200ms | | Cold Start | 1-3s (memory snapshot optimized) | | Embedding Dimensions | 768 | | Supported Audio | Any format (auto-converts to WAV) | | Max Duration | Tested up to 10 minutes | --- ## 🎯 Why VoiceKit MCP? | Criteria | Our Approach | |----------|--------------| | **Functionality** | 6 production-ready tools covering full voice analysis pipeline | | **Innovation** | First MCP server for comprehensive voice analysis | | **Documentation** | Complete API docs with inputs/outputs/use cases | | **Real-world Impact** | Powers Voice Sementle game; applicable to voice cloning, accessibility, language learning | --- ## 🎮 Interactive Demo 👆 **Click the interface above to try each tool!** 1. Upload or record audio 2. Select a tool to test 3. View JSON results with scores and analysis 4. Copy embeddings or transcripts for your app --- ## 🔗 Related Projects - **[Voice Sementle](https://huggingface.co/spaces/MCP-1st-Birthday/Voice-Sementle)** — Daily voice puzzle game powered by VoiceKit MCP --- **Built for [MCP's 1st Birthday Hackathon](https://huggingface.co/MCP-1st-Birthday)** 🎂 *Celebrating one year of Model Context Protocol!*