# ONNX Whisper Javanese ASR Model [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![ONNX](https://img.shields.io/badge/ONNX-1.16+-green.svg)](https://onnx.ai/) [![Transformers](https://img.shields.io/badge/Transformers-4.30+-orange.svg)](https://huggingface.co/transformers/) Optimized ONNX version of Whisper Large V2 fine-tuned for Javanese (Basa Jawa) speech recognition. This model provides fast, CPU-friendly inference for Javanese ASR tasks. ## Model Information - **Base Model**: Whisper Large V2 - **Language**: Javanese (Basa Jawa) - **Task**: Automatic Speech Recognition (ASR) - **Format**: ONNX - **Deployment**: HuggingFace Inference Endpoints ## Features ✅ Optimized ONNX inference (3x faster than PyTorch) ✅ CPU-friendly deployment ✅ Multiple audio format support (WAV, MP3, FLAC, M4A, OGG) ✅ Automatic audio preprocessing (resampling, mono conversion) ✅ JSON output with metadata ✅ Custom handler for HF Inference Endpoints ## Supported Audio Formats - WAV (`.wav`) - MP3 (`.mp3`) - FLAC (`.flac`) - M4A (`.m4a`) - OGG (`.ogg`) Audio files are automatically: - Converted to mono - Resampled to 16kHz - Normalized ## Usage ### Option 1: HuggingFace Inference API (Python) ```python import requests import base64 # Read your audio file with open("audio.wav", "rb") as f: audio_bytes = f.read() # Encode to base64 audio_base64 = base64.b64encode(audio_bytes).decode("utf-8") # API endpoint (replace with your endpoint URL after deployment) API_URL = "https://api-inference.huggingface.co/models/adithyafp/onnx-whisper-jv" # Your HuggingFace API token headers = { "Authorization": "Bearer YOUR_HF_TOKEN" } # Make request response = requests.post( API_URL, headers=headers, json={ "inputs": audio_base64, "parameters": { "max_length": 448, "return_timestamps": False } } ) # Get result result = response.json() print(f"Transcription: {result['transcription']}") print(f"Duration: {result['metadata']['audio_duration_seconds']:.2f}s") ``` ### Option 2: Using cURL ```bash # Encode audio file to base64 AUDIO_BASE64=$(base64 -i audio.wav) # Make API request curl -X POST \ https://api-inference.huggingface.co/models/adithyafp/onnx-whisper-jv \ -H "Authorization: Bearer YOUR_HF_TOKEN" \ -H "Content-Type: application/json" \ -d "{\"inputs\": \"$AUDIO_BASE64\"}" ``` ### Option 3: HuggingFace Hub (Python Client) ```python from huggingface_hub import InferenceClient import base64 client = InferenceClient(token="YOUR_HF_TOKEN") # Read and encode audio with open("audio.wav", "rb") as f: audio_bytes = f.read() audio_base64 = base64.b64encode(audio_bytes).decode("utf-8") # Transcribe result = client.post( json={"inputs": audio_base64}, model="adithyafp/onnx-whisper-jv" ) print(result) ``` ### Option 4: JavaScript/TypeScript ```javascript async function transcribeAudio(audioFile) { // Read audio file const audioBuffer = await audioFile.arrayBuffer(); const audioBase64 = btoa( String.fromCharCode(...new Uint8Array(audioBuffer)) ); // API request const response = await fetch( "https://api-inference.huggingface.co/models/adithyafp/onnx-whisper-jv", { method: "POST", headers: { "Authorization": "Bearer YOUR_HF_TOKEN", "Content-Type": "application/json" }, body: JSON.stringify({ inputs: audioBase64, parameters: { max_length: 448 } }) } ); const result = await response.json(); console.log("Transcription:", result.transcription); return result; } // Usage const audioFile = document.getElementById('audioInput').files[0]; transcribeAudio(audioFile); ``` ## Response Format The API returns a JSON response with the following structure: ```json { "transcription": "Sugeng enjing, kepiye kabare?", "language": "javanese", "status": "success", "metadata": { "audio_duration_seconds": 3.52, "num_tokens": 12, "model": "whisper-large-v2-jv-onnx" } } ``` ### Response Fields - `transcription` (string): The transcribed text in Javanese - `language` (string): Source language ("javanese") - `status` (string): Request status ("success" or "error") - `metadata` (object): - `audio_duration_seconds` (float): Duration of input audio - `num_tokens` (int): Number of tokens generated - `model` (string): Model identifier ## Parameters You can customize the inference with optional parameters: | Parameter | Type | Default | Description | |-----------|------|---------|-------------| | `max_length` | int | 448 | Maximum length of generated tokens | | `return_timestamps` | bool | false | Return word-level timestamps (future) | | `return_token_ids` | bool | false | Include raw token IDs in response | Example with parameters: ```python response = requests.post( API_URL, headers=headers, json={ "inputs": audio_base64, "parameters": { "max_length": 448, "return_token_ids": True } } ) ``` ## Error Handling The API returns error responses in the following format: ```json { "error": "Error message here", "status": "error", "message": "An error occurred during transcription" } ``` Common errors: - Invalid audio format - Invalid base64 encoding - Audio file too large (>10MB recommended limit) - Missing input data ## Model Files This repository contains: - `encoder_model.onnx` + `encoder_model.onnx_data` - ONNX encoder (2.4GB) - `decoder_model.onnx` + `decoder_model.onnx_data` - ONNX decoder (3.6GB) - `decoder_with_past_model.onnx` + `decoder_with_past_model.onnx_data` - ONNX decoder with KV cache (3.2GB) - `tokenizer.json` - Whisper tokenizer - `preprocessor_config.json` - Audio preprocessing config - `config.json` - Model configuration - `generation_config.json` - Generation parameters - `handler.py` - Custom inference handler - `requirements.txt` - Python dependencies ## Performance - **Inference Speed**: ~2-4x faster than PyTorch (CPU) - **Memory Usage**: ~6GB RAM for loading models - **Latency**: ~1-2s for 30s audio (depends on CPU) ## Deployment to HuggingFace Inference Endpoints ### Step 1: Upload Model Files ```bash # Install Git LFS git lfs install # Clone your repository git clone https://huggingface.co/adithyafp/onnx-whisper-jv cd onnx-whisper-jv # Add all files git add . git commit -m "Add ONNX model with custom handler" git push ``` ### Step 2: Create Inference Endpoint 1. Go to [HuggingFace Inference Endpoints](https://ui.endpoints.huggingface.co/) 2. Click "Create Endpoint" 3. Select your model: `adithyafp/onnx-whisper-jv` 4. Choose instance type: CPU (Medium or Large recommended) 5. Deploy! ### Step 3: Test Your Endpoint ```python import requests import base64 # Your endpoint URL (from HF dashboard) ENDPOINT_URL = "https://xxxxxxxx.endpoints.huggingface.cloud" with open("test_audio.wav", "rb") as f: audio_base64 = base64.b64encode(f.read()).decode() response = requests.post( ENDPOINT_URL, headers={"Authorization": "Bearer YOUR_HF_TOKEN"}, json={"inputs": audio_base64} ) print(response.json()) ``` ## Local Testing (Before Deployment) Test the handler locally before deploying: ```python from handler import EndpointHandler import base64 # Initialize handler handler = EndpointHandler(path=".") # Load test audio with open("test_audio.wav", "rb") as f: audio_bytes = f.read() audio_base64 = base64.b64encode(audio_bytes).decode() # Test inference result = handler({ "inputs": audio_base64, "parameters": {"max_length": 448} }) print(result) ``` ## Requirements ```txt onnxruntime>=1.16.0 transformers>=4.30.0 numpy>=1.24.0 librosa>=0.10.0 soundfile>=0.12.1 ``` ## Citation If you use this model, please cite: ```bibtex @misc{whisper-jv-onnx-2024, author = {adithyafp}, title = {ONNX Whisper Javanese ASR Model}, year = {2024}, publisher = {HuggingFace}, url = {https://huggingface.co/adithyafp/onnx-whisper-jv} } ``` ## License Apache 2.0 ## Links - 🤗 [Model on HuggingFace](https://huggingface.co/adithyafp/onnx-whisper-jv) - 📝 [Whisper Paper](https://arxiv.org/abs/2212.04356) - 🔧 [ONNX Runtime](https://onnxruntime.ai/) ## Support For issues or questions: - Open an issue on the [model repository](https://huggingface.co/adithyafp/onnx-whisper-jv/discussions) - Contact: [Your contact information] --- Made with ❤️ for the Javanese language community