---
title: VoiceKit MCP
emoji: 🎤
colorFrom: purple
colorTo: indigo
sdk: gradio
sdk_version: "6.0.0"
app_file: app.py
pinned: false
tags:
- building-mcp-track-creative
- mcp-server
---
# 🎤 VoiceKit MCP
> **Professional voice analysis as MCP tools — extract embeddings, compare voices, transcribe speech, and more.**
6 powerful MCP tools for voice processing, all accepting base64-encoded audio.
📢 **Social Post:** [View on X](https://x.com/dahee_pk/status/1994389505898582442)
🎬 **Demo Video:** [Watch on YouTube](https://www.youtube.com/watch?v=1VIqvpwfyWU)
👥 **Team:** [@EricYoun](https://huggingface.co/EricYoun), [@NickEo](https://huggingface.co/NickEo), [@HYENA-WON](https://huggingface.co/HYENA-WON), [@jjin6573](https://huggingface.co/jjin6573), [@cocoajoa](https://huggingface.co/cocoajoa)
---
## 📋 Submission Info
| | |
|---|---|
| **Track** | Building MCP — Creative |
| **MCP Endpoint** | `https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse` |
| **Framework** | Gradio 6.0 |
---
## ✅ Track 1 Requirements
| Requirement | How We Fulfill It |
|-------------|-------------------|
| **Functioning MCP Server** | 6 MCP tools exposed via Gradio's `mcp_server=True` |
| **MCP Client Demo** | Video shows integration with Claude Desktop / MCP client |
| **Documented Tools** | Full API documentation with inputs/outputs below |
| **Gradio App** | Interactive demo UI + hidden MCP tool interfaces |
---
## 🛠️ MCP Tools (6 Tools)
All tools accept **base64-encoded audio** as input.
### 1. `extract_embedding`
Extract voice embeddings using Wav2Vec2 model.
| | |
|---|---|
| **Input** | `audio_base64` (base64-encoded audio) |
| **Output** | `embedding_preview` (first 5 values), `embedding_length` (768) |
| **Use Case** | Speaker identification, voice fingerprinting |
### 2. `match_voice`
Compare similarity between two voices.
| | |
|---|---|
| **Inputs** | `audio1_base64`, `audio2_base64` |
| **Output** | `similarity` (0-1), `tone_score` (0-100) |
| **Use Case** | Voice cloning verification, speaker matching |
### 3. `analyze_acoustics`
Extract detailed acoustic characteristics.
| | |
|---|---|
| **Input** | `audio_base64` |
| **Output** | Pitch, energy, rhythm, tempo, spectral info |
| **Use Case** | Emotional tone detection, voice profiling |
### 4. `transcribe_audio`
Convert speech to text (multilingual).
| | |
|---|---|
| **Inputs** | `audio_base64`, `language` (default: "en") |
| **Output** | Transcribed text, detected language |
| **Model** | ElevenLabs Scribe v1 |
| **Languages** | English, Korean, Japanese, and 15+ more |
### 5. `isolate_voice`
Remove background music/noise and extract clean voice.
| | |
|---|---|
| **Input** | `audio_base64` (audio with background sounds) |
| **Output** | Isolated audio (base64), BGM detection status |
| **Use Case** | Audio cleanup for memes, songs, movies |
### 6. `grade_voice`
Comprehensive voice comparison with multi-metric scoring.
| | |
|---|---|
| **Inputs** | `user_audio_base64`, `reference_audio_base64`, `reference_text` (optional), `category` (meme\|song\|movie) |
| **Output** | Pitch, rhythm, energy, pronunciation scores (0-100), overall score, user transcription |
| **Use Case** | Voice mimicry evaluation, pronunciation games |
---
## 🏗️ Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ VoiceKit MCP │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ MCP Client (Claude) │ │
│ │ base64 audio → SSE endpoint │ │
│ └──────────────────────────┬─────────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Gradio MCP Server (app.py) │ │
│ │ mcp_server=True • 6 tool interfaces │ │
│ └──────────────────────────┬─────────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Modal GPU Container (T4) │ │
│ │ Wav2Vec2 • librosa • ElevenLabs APIs • DTW │ │
│ └──────────────────────────┬─────────────────────────────────┘ │
│ ↓ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ JSON Response │ │
│ │ embeddings • scores • transcripts • audio │ │
│ └────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
```
---
## 🔌 How to Connect
### Claude Desktop / MCP Client
Add to your MCP configuration:
```json
{
"mcpServers": {
"voicekit": {
"url": "https://mcp-1st-birthday-voicekit.hf.space/gradio_api/mcp/sse"
}
}
}
```
### Example Usage
```python
# 1. Encode audio to base64
import base64
with open("audio.wav", "rb") as f:
audio_base64 = base64.b64encode(f.read()).decode()
# 2. Call MCP tool
result = mcp_client.call("extract_embedding", {"audio_base64": audio_base64})
# 3. Use the 768-dim embedding
embedding = result["embedding"]
```
---
## 🛠️ Tech Stack
| Component | Technology |
|-----------|------------|
| MCP Server | Gradio 6.0 (`mcp_server=True`) |
| GPU Compute | Modal (T4 GPU) |
| Embeddings | Wav2Vec2 (facebook/wav2vec2-base-960h) |
| Speech-to-Text | ElevenLabs Scribe v1 |
| Voice Isolation | ElevenLabs Voice Isolator |
| Acoustic Analysis | librosa + scipy |
---
## ⚡ Performance
| Metric | Value |
|--------|-------|
| Response Time (warm) | <200ms |
| Cold Start | 1-3s (memory snapshot optimized) |
| Embedding Dimensions | 768 |
| Supported Audio | Any format (auto-converts to WAV) |
| Max Duration | Tested up to 10 minutes |
---
## 🎯 Why VoiceKit MCP?
| Criteria | Our Approach |
|----------|--------------|
| **Functionality** | 6 production-ready tools covering full voice analysis pipeline |
| **Innovation** | First MCP server for comprehensive voice analysis |
| **Documentation** | Complete API docs with inputs/outputs/use cases |
| **Real-world Impact** | Powers Voice Sementle game; applicable to voice cloning, accessibility, language learning |
---
## 🎮 Interactive Demo
👆 **Click the interface above to try each tool!**
1. Upload or record audio
2. Select a tool to test
3. View JSON results with scores and analysis
4. Copy embeddings or transcripts for your app
---
## 🔗 Related Projects
- **[Voice Sementle](https://huggingface.co/spaces/MCP-1st-Birthday/Voice-Sementle)** — Daily voice puzzle game powered by VoiceKit MCP
---
**Built for [MCP's 1st Birthday Hackathon](https://huggingface.co/MCP-1st-Birthday)** 🎂
*Celebrating one year of Model Context Protocol!*