YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

ONNX Whisper Javanese ASR Model

Optimized ONNX version of Whisper Large V2 fine-tuned for Javanese (Basa Jawa) speech recognition. This model provides fast, CPU-friendly inference for Javanese ASR tasks.

Model Information

Base Model: Whisper Large V2
Language: Javanese (Basa Jawa)
Task: Automatic Speech Recognition (ASR)
Format: ONNX
Deployment: HuggingFace Inference Endpoints

Features

✅ Optimized ONNX inference (3x faster than PyTorch)
✅ CPU-friendly deployment
✅ Multiple audio format support (WAV, MP3, FLAC, M4A, OGG)
✅ Automatic audio preprocessing (resampling, mono conversion)
✅ JSON output with metadata
✅ Custom handler for HF Inference Endpoints

Supported Audio Formats

WAV (.wav)
MP3 (.mp3)
FLAC (.flac)
M4A (.m4a)
OGG (.ogg)

Audio files are automatically:

Converted to mono
Resampled to 16kHz
Normalized

Usage

Option 1: HuggingFace Inference API (Python)

import requests
import base64

# Read your audio file
with open("audio.wav", "rb") as f:
    audio_bytes = f.read()

# Encode to base64
audio_base64 = base64.b64encode(audio_bytes).decode("utf-8")

# API endpoint (replace with your endpoint URL after deployment)
API_URL = "https://api-inference.huggingface.co/models/adithyafp/onnx-whisper-jv"

# Your HuggingFace API token
headers = {
    "Authorization": "Bearer YOUR_HF_TOKEN"
}

# Make request
response = requests.post(
    API_URL,
    headers=headers,
    json={
        "inputs": audio_base64,
        "parameters": {
            "max_length": 448,
            "return_timestamps": False
        }
    }
)

# Get result
result = response.json()
print(f"Transcription: {result['transcription']}")
print(f"Duration: {result['metadata']['audio_duration_seconds']:.2f}s")

Option 2: Using cURL

# Encode audio file to base64
AUDIO_BASE64=$(base64 -i audio.wav)

# Make API request
curl -X POST \
  https://api-inference.huggingface.co/models/adithyafp/onnx-whisper-jv \
  -H "Authorization: Bearer YOUR_HF_TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"inputs\": \"$AUDIO_BASE64\"}"

Option 3: HuggingFace Hub (Python Client)

from huggingface_hub import InferenceClient
import base64

client = InferenceClient(token="YOUR_HF_TOKEN")

# Read and encode audio
with open("audio.wav", "rb") as f:
    audio_bytes = f.read()

audio_base64 = base64.b64encode(audio_bytes).decode("utf-8")

# Transcribe
result = client.post(
    json={"inputs": audio_base64},
    model="adithyafp/onnx-whisper-jv"
)

print(result)

Option 4: JavaScript/TypeScript

async function transcribeAudio(audioFile) {
  // Read audio file
  const audioBuffer = await audioFile.arrayBuffer();
  const audioBase64 = btoa(
    String.fromCharCode(...new Uint8Array(audioBuffer))
  );
  
  // API request
  const response = await fetch(
    "https://api-inference.huggingface.co/models/adithyafp/onnx-whisper-jv",
    {
      method: "POST",
      headers: {
        "Authorization": "Bearer YOUR_HF_TOKEN",
        "Content-Type": "application/json"
      },
      body: JSON.stringify({
        inputs: audioBase64,
        parameters: {
          max_length: 448
        }
      })
    }
  );
  
  const result = await response.json();
  console.log("Transcription:", result.transcription);
  return result;
}

// Usage
const audioFile = document.getElementById('audioInput').files[0];
transcribeAudio(audioFile);

Response Format

The API returns a JSON response with the following structure:

{
  "transcription": "Sugeng enjing, kepiye kabare?",
  "language": "javanese",
  "status": "success",
  "metadata": {
    "audio_duration_seconds": 3.52,
    "num_tokens": 12,
    "model": "whisper-large-v2-jv-onnx"
  }
}

Response Fields

transcription (string): The transcribed text in Javanese
language (string): Source language ("javanese")
status (string): Request status ("success" or "error")
metadata (object):
- audio_duration_seconds (float): Duration of input audio
- num_tokens (int): Number of tokens generated
- model (string): Model identifier

Parameters

You can customize the inference with optional parameters:

Parameter	Type	Default	Description
`max_length`	int	448	Maximum length of generated tokens
`return_timestamps`	bool	false	Return word-level timestamps (future)
`return_token_ids`	bool	false	Include raw token IDs in response

Example with parameters:

response = requests.post(
    API_URL,
    headers=headers,
    json={
        "inputs": audio_base64,
        "parameters": {
            "max_length": 448,
            "return_token_ids": True
        }
    }
)

Error Handling

The API returns error responses in the following format:

{
  "error": "Error message here",
  "status": "error",
  "message": "An error occurred during transcription"
}

Common errors:

Invalid audio format
Invalid base64 encoding
Audio file too large (>10MB recommended limit)
Missing input data

Model Files

This repository contains:

encoder_model.onnx + encoder_model.onnx_data - ONNX encoder (2.4GB)
decoder_model.onnx + decoder_model.onnx_data - ONNX decoder (3.6GB)
decoder_with_past_model.onnx + decoder_with_past_model.onnx_data - ONNX decoder with KV cache (3.2GB)
tokenizer.json - Whisper tokenizer
preprocessor_config.json - Audio preprocessing config
config.json - Model configuration
generation_config.json - Generation parameters
handler.py - Custom inference handler
requirements.txt - Python dependencies

Performance

Inference Speed: ~2-4x faster than PyTorch (CPU)
Memory Usage: ~6GB RAM for loading models
Latency: ~1-2s for 30s audio (depends on CPU)

Deployment to HuggingFace Inference Endpoints

Step 1: Upload Model Files

# Install Git LFS
git lfs install

# Clone your repository
git clone https://huggingface.co/adithyafp/onnx-whisper-jv
cd onnx-whisper-jv

# Add all files
git add .
git commit -m "Add ONNX model with custom handler"
git push

Step 2: Create Inference Endpoint

Go to HuggingFace Inference Endpoints
Click "Create Endpoint"
Select your model: adithyafp/onnx-whisper-jv
Choose instance type: CPU (Medium or Large recommended)
Deploy!

Step 3: Test Your Endpoint

import requests
import base64

# Your endpoint URL (from HF dashboard)
ENDPOINT_URL = "https://xxxxxxxx.endpoints.huggingface.cloud"

with open("test_audio.wav", "rb") as f:
    audio_base64 = base64.b64encode(f.read()).decode()

response = requests.post(
    ENDPOINT_URL,
    headers={"Authorization": "Bearer YOUR_HF_TOKEN"},
    json={"inputs": audio_base64}
)

print(response.json())

Local Testing (Before Deployment)

Test the handler locally before deploying:

from handler import EndpointHandler
import base64

# Initialize handler
handler = EndpointHandler(path=".")

# Load test audio
with open("test_audio.wav", "rb") as f:
    audio_bytes = f.read()

audio_base64 = base64.b64encode(audio_bytes).decode()

# Test inference
result = handler({
    "inputs": audio_base64,
    "parameters": {"max_length": 448}
})

print(result)

Requirements

onnxruntime>=1.16.0
transformers>=4.30.0
numpy>=1.24.0
librosa>=0.10.0
soundfile>=0.12.1

Citation

If you use this model, please cite:

@misc{whisper-jv-onnx-2024,
  author = {adithyafp},
  title = {ONNX Whisper Javanese ASR Model},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/adithyafp/onnx-whisper-jv}
}

License

Apache 2.0

Support

For issues or questions:

Open an issue on the model repository
Contact: [Your contact information]

Made with ❤️ for the Javanese language community

Downloads last month: 19

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

adithyafp
/

onnx-whisper-jv