# ONNX Whisper Javanese ASR Model

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![ONNX](https://img.shields.io/badge/ONNX-1.16+-green.svg)](https://onnx.ai/)
[![Transformers](https://img.shields.io/badge/Transformers-4.30+-orange.svg)](https://huggingface.co/transformers/)

Optimized ONNX version of Whisper Large V2 fine-tuned for Javanese (Basa Jawa) speech recognition. This model provides fast, CPU-friendly inference for Javanese ASR tasks.

## Model Information

- **Base Model**: Whisper Large V2
- **Language**: Javanese (Basa Jawa)
- **Task**: Automatic Speech Recognition (ASR)
- **Format**: ONNX
- **Deployment**: HuggingFace Inference Endpoints

## Features

✅ Optimized ONNX inference (3x faster than PyTorch)  
✅ CPU-friendly deployment  
✅ Multiple audio format support (WAV, MP3, FLAC, M4A, OGG)  
✅ Automatic audio preprocessing (resampling, mono conversion)  
✅ JSON output with metadata  
✅ Custom handler for HF Inference Endpoints

## Supported Audio Formats

- WAV (`.wav`)
- MP3 (`.mp3`)
- FLAC (`.flac`)
- M4A (`.m4a`)
- OGG (`.ogg`)

Audio files are automatically:
- Converted to mono
- Resampled to 16kHz
- Normalized

## Usage

### Option 1: HuggingFace Inference API (Python)

```python
import requests
import base64

# Read your audio file
with open("audio.wav", "rb") as f:
    audio_bytes = f.read()

# Encode to base64
audio_base64 = base64.b64encode(audio_bytes).decode("utf-8")

# API endpoint (replace with your endpoint URL after deployment)
API_URL = "https://api-inference.huggingface.co/models/adithyafp/onnx-whisper-jv"

# Your HuggingFace API token
headers = {
    "Authorization": "Bearer YOUR_HF_TOKEN"
}

# Make request
response = requests.post(
    API_URL,
    headers=headers,
    json={
        "inputs": audio_base64,
        "parameters": {
            "max_length": 448,
            "return_timestamps": False
        }
    }
)

# Get result
result = response.json()
print(f"Transcription: {result['transcription']}")
print(f"Duration: {result['metadata']['audio_duration_seconds']:.2f}s")
```

### Option 2: Using cURL

```bash
# Encode audio file to base64
AUDIO_BASE64=$(base64 -i audio.wav)

# Make API request
curl -X POST \
  https://api-inference.huggingface.co/models/adithyafp/onnx-whisper-jv \
  -H "Authorization: Bearer YOUR_HF_TOKEN" \
  -H "Content-Type: application/json" \
  -d "{\"inputs\": \"$AUDIO_BASE64\"}"
```

### Option 3: HuggingFace Hub (Python Client)

```python
from huggingface_hub import InferenceClient
import base64

client = InferenceClient(token="YOUR_HF_TOKEN")

# Read and encode audio
with open("audio.wav", "rb") as f:
    audio_bytes = f.read()

audio_base64 = base64.b64encode(audio_bytes).decode("utf-8")

# Transcribe
result = client.post(
    json={"inputs": audio_base64},
    model="adithyafp/onnx-whisper-jv"
)

print(result)
```

### Option 4: JavaScript/TypeScript

```javascript
async function transcribeAudio(audioFile) {
  // Read audio file
  const audioBuffer = await audioFile.arrayBuffer();
  const audioBase64 = btoa(
    String.fromCharCode(...new Uint8Array(audioBuffer))
  );
  
  // API request
  const response = await fetch(
    "https://api-inference.huggingface.co/models/adithyafp/onnx-whisper-jv",
    {
      method: "POST",
      headers: {
        "Authorization": "Bearer YOUR_HF_TOKEN",
        "Content-Type": "application/json"
      },
      body: JSON.stringify({
        inputs: audioBase64,
        parameters: {
          max_length: 448
        }
      })
    }
  );
  
  const result = await response.json();
  console.log("Transcription:", result.transcription);
  return result;
}

// Usage
const audioFile = document.getElementById('audioInput').files[0];
transcribeAudio(audioFile);
```

## Response Format

The API returns a JSON response with the following structure:

```json
{
  "transcription": "Sugeng enjing, kepiye kabare?",
  "language": "javanese",
  "status": "success",
  "metadata": {
    "audio_duration_seconds": 3.52,
    "num_tokens": 12,
    "model": "whisper-large-v2-jv-onnx"
  }
}
```

### Response Fields

- `transcription` (string): The transcribed text in Javanese
- `language` (string): Source language ("javanese")
- `status` (string): Request status ("success" or "error")
- `metadata` (object):
  - `audio_duration_seconds` (float): Duration of input audio
  - `num_tokens` (int): Number of tokens generated
  - `model` (string): Model identifier

## Parameters

You can customize the inference with optional parameters:

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `max_length` | int | 448 | Maximum length of generated tokens |
| `return_timestamps` | bool | false | Return word-level timestamps (future) |
| `return_token_ids` | bool | false | Include raw token IDs in response |

Example with parameters:

```python
response = requests.post(
    API_URL,
    headers=headers,
    json={
        "inputs": audio_base64,
        "parameters": {
            "max_length": 448,
            "return_token_ids": True
        }
    }
)
```

## Error Handling

The API returns error responses in the following format:

```json
{
  "error": "Error message here",
  "status": "error",
  "message": "An error occurred during transcription"
}
```

Common errors:
- Invalid audio format
- Invalid base64 encoding
- Audio file too large (>10MB recommended limit)
- Missing input data

## Model Files

This repository contains:

- `encoder_model.onnx` + `encoder_model.onnx_data` - ONNX encoder (2.4GB)
- `decoder_model.onnx` + `decoder_model.onnx_data` - ONNX decoder (3.6GB)
- `decoder_with_past_model.onnx` + `decoder_with_past_model.onnx_data` - ONNX decoder with KV cache (3.2GB)
- `tokenizer.json` - Whisper tokenizer
- `preprocessor_config.json` - Audio preprocessing config
- `config.json` - Model configuration
- `generation_config.json` - Generation parameters
- `handler.py` - Custom inference handler
- `requirements.txt` - Python dependencies

## Performance

- **Inference Speed**: ~2-4x faster than PyTorch (CPU)
- **Memory Usage**: ~6GB RAM for loading models
- **Latency**: ~1-2s for 30s audio (depends on CPU)

## Deployment to HuggingFace Inference Endpoints

### Step 1: Upload Model Files

```bash
# Install Git LFS
git lfs install

# Clone your repository
git clone https://huggingface.co/adithyafp/onnx-whisper-jv
cd onnx-whisper-jv

# Add all files
git add .
git commit -m "Add ONNX model with custom handler"
git push
```

### Step 2: Create Inference Endpoint

1. Go to [HuggingFace Inference Endpoints](https://ui.endpoints.huggingface.co/)
2. Click "Create Endpoint"
3. Select your model: `adithyafp/onnx-whisper-jv`
4. Choose instance type: CPU (Medium or Large recommended)
5. Deploy!

### Step 3: Test Your Endpoint

```python
import requests
import base64

# Your endpoint URL (from HF dashboard)
ENDPOINT_URL = "https://xxxxxxxx.endpoints.huggingface.cloud"

with open("test_audio.wav", "rb") as f:
    audio_base64 = base64.b64encode(f.read()).decode()

response = requests.post(
    ENDPOINT_URL,
    headers={"Authorization": "Bearer YOUR_HF_TOKEN"},
    json={"inputs": audio_base64}
)

print(response.json())
```

## Local Testing (Before Deployment)

Test the handler locally before deploying:

```python
from handler import EndpointHandler
import base64

# Initialize handler
handler = EndpointHandler(path=".")

# Load test audio
with open("test_audio.wav", "rb") as f:
    audio_bytes = f.read()

audio_base64 = base64.b64encode(audio_bytes).decode()

# Test inference
result = handler({
    "inputs": audio_base64,
    "parameters": {"max_length": 448}
})

print(result)
```

## Requirements

```txt
onnxruntime>=1.16.0
transformers>=4.30.0
numpy>=1.24.0
librosa>=0.10.0
soundfile>=0.12.1
```

## Citation

If you use this model, please cite:

```bibtex
@misc{whisper-jv-onnx-2024,
  author = {adithyafp},
  title = {ONNX Whisper Javanese ASR Model},
  year = {2024},
  publisher = {HuggingFace},
  url = {https://huggingface.co/adithyafp/onnx-whisper-jv}
}
```

## License

Apache 2.0

## Links

- 🤗 [Model on HuggingFace](https://huggingface.co/adithyafp/onnx-whisper-jv)
- 📝 [Whisper Paper](https://arxiv.org/abs/2212.04356)
- 🔧 [ONNX Runtime](https://onnxruntime.ai/)

## Support

For issues or questions:
- Open an issue on the [model repository](https://huggingface.co/adithyafp/onnx-whisper-jv/discussions)
- Contact: [Your contact information]

---

Made with ❤️ for the Javanese language community