--- title: Bizom_Voice_Assistant app_file: app.py sdk: gradio sdk_version: 5.50.0 --- # FastRTC Audio Streaming with Transcription & TTS A real-time audio streaming application built with FastRTC that provides: - **Speech-to-Text (STT)**: Transcribes incoming audio in real-time - **Text-to-Speech (TTS)**: Converts transcribed text back to audio - **API Streaming Support**: Connect from external clients (Android/KMM apps) via WebRTC - **Bidirectional Communication**: Send and receive audio with transcription feedback ## Features - 🎀 **Real-time Audio Streaming**: Low-latency audio streaming using WebRTC - πŸ“ **Automatic Transcription**: Speech-to-text using Moonshine STT model - πŸ”Š **Voice Response**: Text-to-speech using Kokoro TTS model - πŸ“‘ **API Support**: Connect from external applications via WebRTC API - 🌐 **Network Access**: Accepts connections from network (not just localhost) - ⏸️ **Pause Detection**: Uses `ReplyOnPause` handler to process complete utterances ## Prerequisites - Python 3.8 or higher - pip (Python package manager) - Hugging Face account with API token (for Cloudflare TURN credentials in production) - Google Gemini API key (optional, for AI responses) ## Installation 1. **Clone or navigate to the project directory:** ```bash cd fastrtc ``` 2. **Create a virtual environment (recommended):** ```bash python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate ``` 3. **Set up environment variables:** Create a `.env` file in the project root: ```bash touch .env ``` Add your API keys to `.env`: ```env HF_TOKEN=hf_... # Required for Cloudflare TURN credentials (production deployment) GEMINI_API_KEY=... # Optional: for AI responses via Google Gemini ``` **Important:** Never commit your `.env` file to git! It should be in `.gitignore`. 4. **Install dependencies:** ```bash pip install -r requirements.txt ``` This will install: - `fastrtc[vad, stt, tts]` - FastRTC with VAD, STT, and TTS support - `google-genai` - Google Generative AI client - `python-dotenv` - Environment variable management - All required dependencies (numpy, gradio, etc.) ## Usage ### Running the Server 1. **Activate your virtual environment** (if not already activated): ```bash source venv/bin/activate ``` 2. **Run the application:** ```bash python app.py ``` 3. **Access the web interface:** - Open your browser and navigate to: `http://localhost:7860` - The Gradio interface will be available for testing ### Server Configuration The server is configured to: - Listen on `0.0.0.0:7860` (accepts connections from network) - Use `ReplyOnPause` handler (processes audio when user pauses speaking) - Support bidirectional audio (`send-receive` mode) To modify settings, edit `app.py`: ```python stream.ui.launch( server_name="0.0.0.0", # Change to "127.0.0.1" for localhost only server_port=7860, # Change port if needed share=False # Set to True for public Gradio URL ) ``` ## API Streaming for External Clients This application supports connecting from external clients (e.g., Android/KMM apps) via the FastRTC WebRTC API. ### Connection Endpoint - **URL**: `http://YOUR_SERVER_IP:7860/webrtc/offer` - **Method**: POST - **Content-Type**: application/json ### Request Format ```json { "sdp": "", "type": "offer" } ``` ### Response Format ```json { "sdp": "", "type": "answer", "webrtc_id": "" } ``` ### Message Types The server sends messages via Data Channel with the following format: ```json { "type": "fetch_output" | "log" | "error" | "warning", "data": "" } ``` #### Transcription Messages When audio is transcribed, clients receive: ```json { "type": "fetch_output", "data": "" } ``` #### Log Messages The server sends log messages for debugging: ```json { "type": "log", "data": "pause_detected" | "response_starting" | "started_talking" } ``` ### Connecting from Android/KMM App 1. **Establish WebRTC Connection:** - Create a `PeerConnection` with ICE servers - Create an audio track from microphone - Create a data channel for text messages 2. **Send WebRTC Offer:** - Create an offer - POST to `/webrtc/offer` endpoint - Receive answer and set remote description 3. **Handle Messages:** - Listen for `fetch_output` messages on data channel - Display transcription text - Play received audio (TTS response) 4. **Receive Audio:** - Audio track receives TTS audio response - Play through device speakers/headphones For detailed Android/KMM implementation, see the [FastRTC API Documentation](https://fastrtc.org/userguide/api/). ## Architecture ### Components 1. **STT Model** (`moonshine/base`): - Converts speech audio to text - Processes complete utterances (on pause) 2. **TTS Model** (`kokoro`): - Converts transcribed text to speech audio - Uses voice: `af_heart` - Language: `en-us` 3. **ReplyOnPause Handler**: - Buffers audio chunks - Detects when user stops speaking - Processes complete utterances 4. **Stream Handler**: - Receives audio from client - Transcribes using STT - Sends transcription via `AdditionalOutputs` - Generates TTS audio - Returns audio to client ### Flow Diagram ``` Client (Android/Web) ↓ [Audio Stream] WebRTC Connection ↓ ReplyOnPause Handler (buffers audio) ↓ [On Pause] Echo Handler ↓ STT Model β†’ Transcription ↓ AdditionalOutputs β†’ Client (via Data Channel) ↓ TTS Model β†’ Audio Response ↓ [Audio Stream] WebRTC Connection ↓ Client (plays audio) ``` ## Deployment ### Cloudflare TURN Configuration This application uses Cloudflare TURN servers for improved WebRTC connectivity, especially important for production deployments where clients may be behind NATs or firewalls. **Required for Production:** - Set the `HF_TOKEN` environment variable with your Hugging Face API token - The application will automatically configure Cloudflare TURN credentials for both client and server **Configuration Details:** - **Client RTC Configuration**: Uses async `get_cloudflare_turn_credentials_async()` to fetch credentials dynamically - **Server RTC Configuration**: Uses `get_cloudflare_turn_credentials(ttl=360_000)` with 100-hour TTL - If `HF_TOKEN` is not set, the app will run without TURN configuration (may have connectivity issues in production) ### Environment Variables | Variable | Required | Description | |----------|----------|-------------| | `HF_TOKEN` | Yes (for production) | Hugging Face API token for Cloudflare TURN credentials | | `GEMINI_API_KEY` | No | Google Gemini API key for AI-powered responses | ### Deployment Platforms The application can be deployed to various platforms: 1. **Cloud Platforms** (AWS, GCP, Azure, etc.): - Set environment variables in your platform's configuration - Ensure port 7860 is accessible - The app listens on `0.0.0.0:7860` by default 2. **Docker**: ```dockerfile FROM python:3.11-slim WORKDIR /app COPY requirements.txt . RUN pip install -r requirements.txt COPY . . ENV HF_TOKEN=${HF_TOKEN} ENV GEMINI_API_KEY=${GEMINI_API_KEY} CMD ["python", "app.py"] ``` 3. **Platform-as-a-Service** (Heroku, Railway, etc.): - Set environment variables in your platform dashboard - The app will automatically use them via `python-dotenv` ## Configuration ### TTS Options Modify TTS settings in `app.py`: ```python tts_options = KokoroTTSOptions( voice="af_heart", # Change voice speed=1.0, # Adjust speed (0.5 - 2.0) lang="en-us" # Change language ) ``` ### STT Model Change STT model in `app.py`: ```python stt_model = get_stt_model(model="moonshine/base") # Change model ``` ## Error Handling The application includes error handling for: - Empty transcriptions (yields silence) - TTS generation errors (yields silence fallback) - Connection errors (handled by FastRTC) ## Troubleshooting ### Server Not Accessible from Network - Ensure `server_name="0.0.0.0"` in `app.py` - Check firewall settings - Verify server IP address ### No Transcription Received - Check that audio is being sent from client - Verify STT model is loaded correctly - Check console logs for errors ### TTS Errors - Ensure text is not empty before calling TTS - Check TTS model is loaded correctly - Verify TTS options are valid ## Development ### Project Structure ``` fastrtc/ β”œβ”€β”€ app.py # Main application file β”œβ”€β”€ requirements.txt # Python dependencies β”œβ”€β”€ README.md # This file └── venv/ # Virtual environment (gitignored) ``` ### Dependencies - `fastrtc[stt]` - FastRTC with STT support - `numpy` - Audio processing - `gradio` - Web interface ## Resources - [FastRTC Documentation](https://fastrtc.org/) - [FastRTC API Guide](https://fastrtc.org/userguide/api/) - [FastRTC Audio Streaming](https://fastrtc.org/userguide/audio/) ## License [Add your license here] ## Contributing [Add contribution guidelines here]