ONNX Whisper Javanese ASR Model
Optimized ONNX version of Whisper Large V2 fine-tuned for Javanese (Basa Jawa) speech recognition. This model provides fast, CPU-friendly inference for Javanese ASR tasks.
Model Information
- Base Model: Whisper Large V2
- Language: Javanese (Basa Jawa)
- Task: Automatic Speech Recognition (ASR)
- Format: ONNX
- Deployment: HuggingFace Inference Endpoints
Features
β
Optimized ONNX inference (3x faster than PyTorch)
β
CPU-friendly deployment
β
Multiple audio format support (WAV, MP3, FLAC, M4A, OGG)
β
Automatic audio preprocessing (resampling, mono conversion)
β
JSON output with metadata
β
Custom handler for HF Inference Endpoints
Supported Audio Formats
- WAV (
.wav) - MP3 (
.mp3) - FLAC (
.flac) - M4A (
.m4a) - OGG (
.ogg)
Audio files are automatically:
- Converted to mono
- Resampled to 16kHz
- Normalized
Usage
Option 1: HuggingFace Inference API (Python)
import requests
import base64
# Read your audio file
with open("audio.wav", "rb") as f:
audio_bytes = f.read()
# Encode to base64
audio_base64 = base64.b64encode(audio_bytes).decode("utf-8")
# API endpoint (replace with your endpoint URL after deployment)
API_URL = "https://api-inference.huggingface.co/models/adithyafp/onnx-whisper-jv"
# Your HuggingFace API token
headers = {
"Authorization": "Bearer YOUR_HF_TOKEN"
}
# Make request
response = requests.post(
API_URL,
headers=headers,
json={
"inputs": audio_base64,
"parameters": {
"max_length": 448,
"return_timestamps": False
}
}
)
# Get result
result = response.json()
print(f"Transcription: {result['transcription']}")
print(f"Duration: {result['metadata']['audio_duration_seconds']:.2f}s")
Option 2: Using cURL
# Encode audio file to base64
AUDIO_BASE64=$(base64 -i audio.wav)
# Make API request
curl -X POST \
https://api-inference.huggingface.co/models/adithyafp/onnx-whisper-jv \
-H "Authorization: Bearer YOUR_HF_TOKEN" \
-H "Content-Type: application/json" \
-d "{\"inputs\": \"$AUDIO_BASE64\"}"
Option 3: HuggingFace Hub (Python Client)
from huggingface_hub import InferenceClient
import base64
client = InferenceClient(token="YOUR_HF_TOKEN")
# Read and encode audio
with open("audio.wav", "rb") as f:
audio_bytes = f.read()
audio_base64 = base64.b64encode(audio_bytes).decode("utf-8")
# Transcribe
result = client.post(
json={"inputs": audio_base64},
model="adithyafp/onnx-whisper-jv"
)
print(result)
Option 4: JavaScript/TypeScript
async function transcribeAudio(audioFile) {
// Read audio file
const audioBuffer = await audioFile.arrayBuffer();
const audioBase64 = btoa(
String.fromCharCode(...new Uint8Array(audioBuffer))
);
// API request
const response = await fetch(
"https://api-inference.huggingface.co/models/adithyafp/onnx-whisper-jv",
{
method: "POST",
headers: {
"Authorization": "Bearer YOUR_HF_TOKEN",
"Content-Type": "application/json"
},
body: JSON.stringify({
inputs: audioBase64,
parameters: {
max_length: 448
}
})
}
);
const result = await response.json();
console.log("Transcription:", result.transcription);
return result;
}
// Usage
const audioFile = document.getElementById('audioInput').files[0];
transcribeAudio(audioFile);
Response Format
The API returns a JSON response with the following structure:
{
"transcription": "Sugeng enjing, kepiye kabare?",
"language": "javanese",
"status": "success",
"metadata": {
"audio_duration_seconds": 3.52,
"num_tokens": 12,
"model": "whisper-large-v2-jv-onnx"
}
}
Response Fields
transcription(string): The transcribed text in Javaneselanguage(string): Source language ("javanese")status(string): Request status ("success" or "error")metadata(object):audio_duration_seconds(float): Duration of input audionum_tokens(int): Number of tokens generatedmodel(string): Model identifier
Parameters
You can customize the inference with optional parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
max_length |
int | 448 | Maximum length of generated tokens |
return_timestamps |
bool | false | Return word-level timestamps (future) |
return_token_ids |
bool | false | Include raw token IDs in response |
Example with parameters:
response = requests.post(
API_URL,
headers=headers,
json={
"inputs": audio_base64,
"parameters": {
"max_length": 448,
"return_token_ids": True
}
}
)
Error Handling
The API returns error responses in the following format:
{
"error": "Error message here",
"status": "error",
"message": "An error occurred during transcription"
}
Common errors:
- Invalid audio format
- Invalid base64 encoding
- Audio file too large (>10MB recommended limit)
- Missing input data
Model Files
This repository contains:
encoder_model.onnx+encoder_model.onnx_data- ONNX encoder (2.4GB)decoder_model.onnx+decoder_model.onnx_data- ONNX decoder (3.6GB)decoder_with_past_model.onnx+decoder_with_past_model.onnx_data- ONNX decoder with KV cache (3.2GB)tokenizer.json- Whisper tokenizerpreprocessor_config.json- Audio preprocessing configconfig.json- Model configurationgeneration_config.json- Generation parametershandler.py- Custom inference handlerrequirements.txt- Python dependencies
Performance
- Inference Speed: ~2-4x faster than PyTorch (CPU)
- Memory Usage: ~6GB RAM for loading models
- Latency: ~1-2s for 30s audio (depends on CPU)
Deployment to HuggingFace Inference Endpoints
Step 1: Upload Model Files
# Install Git LFS
git lfs install
# Clone your repository
git clone https://huggingface.co/adithyafp/onnx-whisper-jv
cd onnx-whisper-jv
# Add all files
git add .
git commit -m "Add ONNX model with custom handler"
git push
Step 2: Create Inference Endpoint
- Go to HuggingFace Inference Endpoints
- Click "Create Endpoint"
- Select your model:
adithyafp/onnx-whisper-jv - Choose instance type: CPU (Medium or Large recommended)
- Deploy!
Step 3: Test Your Endpoint
import requests
import base64
# Your endpoint URL (from HF dashboard)
ENDPOINT_URL = "https://xxxxxxxx.endpoints.huggingface.cloud"
with open("test_audio.wav", "rb") as f:
audio_base64 = base64.b64encode(f.read()).decode()
response = requests.post(
ENDPOINT_URL,
headers={"Authorization": "Bearer YOUR_HF_TOKEN"},
json={"inputs": audio_base64}
)
print(response.json())
Local Testing (Before Deployment)
Test the handler locally before deploying:
from handler import EndpointHandler
import base64
# Initialize handler
handler = EndpointHandler(path=".")
# Load test audio
with open("test_audio.wav", "rb") as f:
audio_bytes = f.read()
audio_base64 = base64.b64encode(audio_bytes).decode()
# Test inference
result = handler({
"inputs": audio_base64,
"parameters": {"max_length": 448}
})
print(result)
Requirements
onnxruntime>=1.16.0
transformers>=4.30.0
numpy>=1.24.0
librosa>=0.10.0
soundfile>=0.12.1
Citation
If you use this model, please cite:
@misc{whisper-jv-onnx-2024,
author = {adithyafp},
title = {ONNX Whisper Javanese ASR Model},
year = {2024},
publisher = {HuggingFace},
url = {https://huggingface.co/adithyafp/onnx-whisper-jv}
}
License
Apache 2.0
Links
- π€ Model on HuggingFace
- π Whisper Paper
- π§ ONNX Runtime
Support
For issues or questions:
- Open an issue on the model repository
- Contact: [Your contact information]
Made with β€οΈ for the Javanese language community
- Downloads last month
- 19