YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

ONNX Whisper Javanese ASR Model

License ONNX Transformers

Optimized ONNX version of Whisper Large V2 fine-tuned for Javanese (Basa Jawa) speech recognition. This model provides fast, CPU-friendly inference for Javanese ASR tasks.

Model Information

  • Base Model: Whisper Large V2
  • Language: Javanese (Basa Jawa)
  • Task: Automatic Speech Recognition (ASR)
  • Format: ONNX
  • Deployment: HuggingFace Inference Endpoints

Features

βœ… Optimized ONNX inference (3x faster than PyTorch)
βœ… CPU-friendly deployment
βœ… Multiple audio format support (WAV, MP3, FLAC, M4A, OGG)
βœ… Automatic audio preprocessing (resampling, mono conversion)
βœ… JSON output with metadata
βœ… Custom handler for HF Inference Endpoints

Supported Audio Formats

  • WAV (.wav)
  • MP3 (.mp3)
  • FLAC (.flac)
  • M4A (.m4a)
  • OGG (.ogg)

Audio files are automatically:

  • Converted to mono
  • Resampled to 16kHz
  • Normalized

Usage

Option 1: HuggingFace Inference API (Python)

import requests
import base64

# Read your audio file
with open("audio.wav", "rb") as f:
    audio_bytes = f.read()

# Encode to base64
audio_base64 = base64.b64encode(audio_bytes).decode("utf-8")

# API endpoint (replace with your endpoint URL after deployment)
API_URL = "https://api-inference.huggingface.co/models/adithyafp/onnx-whisper-jv"

# Your HuggingFace API token
headers = {
    "Authorization": "Bearer YOUR_HF_TOKEN"
}

# Make request
response = requests.post(
    API_URL,
    headers=headers,
    json={
        "inputs": audio_base64,
        "parameters": {
            "max_length": 448,
            "return_timestamps": False
        }
    }
)

# Get result
result = response.json()
print(f"Transcription: {result['transcription']}")
print(f"Duration: {result['metadata']['audio_duration_seconds']:.2f}s")

Option 2: Using cURL

# Encode audio file to base64
AUDIO_BASE64=$(base64 -i audio.wav)

# Make API request
curl -X POST \
  https://api-inference.huggingface.co/models/adithyafp/onnx-whisper-jv \
  -H "Authorization: Bearer YOUR_HF_TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"inputs\": \"$AUDIO_BASE64\"}"

Option 3: HuggingFace Hub (Python Client)

from huggingface_hub import InferenceClient
import base64

client = InferenceClient(token="YOUR_HF_TOKEN")

# Read and encode audio
with open("audio.wav", "rb") as f:
    audio_bytes = f.read()

audio_base64 = base64.b64encode(audio_bytes).decode("utf-8")

# Transcribe
result = client.post(
    json={"inputs": audio_base64},
    model="adithyafp/onnx-whisper-jv"
)

print(result)

Option 4: JavaScript/TypeScript

async function transcribeAudio(audioFile) {
  // Read audio file
  const audioBuffer = await audioFile.arrayBuffer();
  const audioBase64 = btoa(
    String.fromCharCode(...new Uint8Array(audioBuffer))
  );
  
  // API request
  const response = await fetch(
    "https://api-inference.huggingface.co/models/adithyafp/onnx-whisper-jv",
    {
      method: "POST",
      headers: {
        "Authorization": "Bearer YOUR_HF_TOKEN",
        "Content-Type": "application/json"
      },
      body: JSON.stringify({
        inputs: audioBase64,
        parameters: {
          max_length: 448
        }
      })
    }
  );
  
  const result = await response.json();
  console.log("Transcription:", result.transcription);
  return result;
}

// Usage
const audioFile = document.getElementById('audioInput').files[0];
transcribeAudio(audioFile);

Response Format

The API returns a JSON response with the following structure:

{
  "transcription": "Sugeng enjing, kepiye kabare?",
  "language": "javanese",
  "status": "success",
  "metadata": {
    "audio_duration_seconds": 3.52,
    "num_tokens": 12,
    "model": "whisper-large-v2-jv-onnx"
  }
}

Response Fields

  • transcription (string): The transcribed text in Javanese
  • language (string): Source language ("javanese")
  • status (string): Request status ("success" or "error")
  • metadata (object):
    • audio_duration_seconds (float): Duration of input audio
    • num_tokens (int): Number of tokens generated
    • model (string): Model identifier

Parameters

You can customize the inference with optional parameters:

Parameter Type Default Description
max_length int 448 Maximum length of generated tokens
return_timestamps bool false Return word-level timestamps (future)
return_token_ids bool false Include raw token IDs in response

Example with parameters:

response = requests.post(
    API_URL,
    headers=headers,
    json={
        "inputs": audio_base64,
        "parameters": {
            "max_length": 448,
            "return_token_ids": True
        }
    }
)

Error Handling

The API returns error responses in the following format:

{
  "error": "Error message here",
  "status": "error",
  "message": "An error occurred during transcription"
}

Common errors:

  • Invalid audio format
  • Invalid base64 encoding
  • Audio file too large (>10MB recommended limit)
  • Missing input data

Model Files

This repository contains:

  • encoder_model.onnx + encoder_model.onnx_data - ONNX encoder (2.4GB)
  • decoder_model.onnx + decoder_model.onnx_data - ONNX decoder (3.6GB)
  • decoder_with_past_model.onnx + decoder_with_past_model.onnx_data - ONNX decoder with KV cache (3.2GB)
  • tokenizer.json - Whisper tokenizer
  • preprocessor_config.json - Audio preprocessing config
  • config.json - Model configuration
  • generation_config.json - Generation parameters
  • handler.py - Custom inference handler
  • requirements.txt - Python dependencies

Performance

  • Inference Speed: ~2-4x faster than PyTorch (CPU)
  • Memory Usage: ~6GB RAM for loading models
  • Latency: ~1-2s for 30s audio (depends on CPU)

Deployment to HuggingFace Inference Endpoints

Step 1: Upload Model Files

# Install Git LFS
git lfs install

# Clone your repository
git clone https://huggingface.co/adithyafp/onnx-whisper-jv
cd onnx-whisper-jv

# Add all files
git add .
git commit -m "Add ONNX model with custom handler"
git push

Step 2: Create Inference Endpoint

  1. Go to HuggingFace Inference Endpoints
  2. Click "Create Endpoint"
  3. Select your model: adithyafp/onnx-whisper-jv
  4. Choose instance type: CPU (Medium or Large recommended)
  5. Deploy!

Step 3: Test Your Endpoint

import requests
import base64

# Your endpoint URL (from HF dashboard)
ENDPOINT_URL = "https://xxxxxxxx.endpoints.huggingface.cloud"

with open("test_audio.wav", "rb") as f:
    audio_base64 = base64.b64encode(f.read()).decode()

response = requests.post(
    ENDPOINT_URL,
    headers={"Authorization": "Bearer YOUR_HF_TOKEN"},
    json={"inputs": audio_base64}
)

print(response.json())

Local Testing (Before Deployment)

Test the handler locally before deploying:

from handler import EndpointHandler
import base64

# Initialize handler
handler = EndpointHandler(path=".")

# Load test audio
with open("test_audio.wav", "rb") as f:
    audio_bytes = f.read()

audio_base64 = base64.b64encode(audio_bytes).decode()

# Test inference
result = handler({
    "inputs": audio_base64,
    "parameters": {"max_length": 448}
})

print(result)

Requirements

onnxruntime>=1.16.0
transformers>=4.30.0
numpy>=1.24.0
librosa>=0.10.0
soundfile>=0.12.1

Citation

If you use this model, please cite:

@misc{whisper-jv-onnx-2024,
  author = {adithyafp},
  title = {ONNX Whisper Javanese ASR Model},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/adithyafp/onnx-whisper-jv}
}

License

Apache 2.0

Links

Support

For issues or questions:


Made with ❀️ for the Javanese language community

Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support