Update Pipeline Tag (#7)

59245aa verified about 1 month ago

9.72 kB

	---
	language:
	- en
	- ar
	- bn
	- zh
	- da
	- nl
	- de
	- fi
	- fr
	- hi
	- id
	- it
	- ja
	- ko
	- mr
	- 'no'
	- pl
	- pt
	- ru
	- es
	- tr
	- uk
	- vi
	license: apache-2.0
	library_name: onnxruntime
	pipeline_tag: voice-activity-detection
	tags:
	- turn-detection
	- end-of-utterance
	- mmbert
	- onnx
	- quantized
	- conversational-ai
	- voice-assistant
	- real-time
	- voice-activity-detection
	base_model: jhu-clsp/mmBERT-base
	datasets:
	- videosdk-live/Namo-Turn-Detector-v1-Train
	model-index:
	- name: Namo Turn Detector v1 - Multilingual
	results:
	- task:
	type: text-classification
	name: Turn Detection
	dataset:
	name: Namo Turn Detector v1 Test - Multilingual
	type: videosdk-live/Namo-Turn-Detector-v1-Test
	split: train
	metrics:
	- type: accuracy
	value: 0.9025
	---

	# 🎯 Namo Turn Detector v1 - MultiLingual

	<div align="center">

	[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
	[![ONNX](https://img.shields.io/badge/ONNX-Optimized-brightgreen)](https://onnx.ai/)
	[![Model Size](https://img.shields.io/badge/Model%20Size-~295M-orange)](https://huggingface.co/videosdk-live/Namo-Turn-Detector-v1-Multilingual)
	[![Inference Speed](https://img.shields.io/badge/Inference-<29ms-red)]()

	🚀 Namo Turn Detection Model for Multiple Languages

	🇸🇦 Arabic, 🇮🇳 Bengali, 🇨🇳 Chinese, 🇩🇰 Danish, 🇳🇱 Dutch, 🇩🇪 German, 🇬🇧🇺🇸 English, 🇫🇮 Finnish, 🇫🇷 French, 🇮🇳 Hindi, 🇮🇩 Indonesian, 🇮🇹 Italian, 🇯🇵 Japanese, 🇰🇷 Korean, 🇮🇳 Marathi, 🇳🇴 Norwegian, 🇵🇱 Polish, 🇵🇹 Portuguese, 🇷🇺 Russian, 🇪🇸 Spanish, 🇹🇷 Turkish, 🇺🇦 Ukrainian, and 🇻🇳 Vietnamese

	</div>

	---

	## 📋 Overview

	The Namo Turn Detector is a specialized AI model designed to solve one of the most challenging problems in conversational AI: knowing when a user has finished speaking.

	This Multilingual model uses advanced natural language understanding to distinguish between:
	- ✅ Complete utterances (user is done speaking)
	- 🔄 Incomplete utterances (user will continue speaking)

	Built on mmBERT architecture and optimized with quantized ONNX format, it delivers enterprise-grade performance with minimal latency.

	## 🔑 Key Features

	- Turn Detection Specialist: Detects end-of-turn vs. continuation in multilingual speech transcripts.
	- Low Latency: Optimized with quantized ONNX for <29ms inference.
	- Robust Performance: Average 90.25% accuracy on multilingual utterances.
	- Easy Integration: Compatible with Python, ONNX Runtime, and VideoSDK Agents SDK.
	- Enterprise Ready: Supports real-time conversational AI and voice assistants.

	## 📊 Performance Metrics
	<div>

	\| Metric \| Score \|
	\|--------\|-------\|
	\| ⚡ Latency \| <29ms \|
	\| 💾 Model Size \| ~295MB \|

	\| Language \| Accuracy \| Precision \| Recall \| F1 Score \| Samples \|
	\| --------------- \| -------- \| --------- \| ------ \| -------- \| ------- \|
	\| 🇹🇷 Turkish \| 0.9731 \| 0.9611 \| 0.9853 \| 0.9730 \| 966 \|
	\| 🇰🇷 Korean \| 0.9685 \| 0.9541 \| 0.9842 \| 0.9690 \| 890 \|
	\| 🇩🇪 German \| 0.9425 \| 0.9135 \| 0.9772 \| 0.9443 \| 1322 \|
	\| 🇯🇵 Japanese \| 0.9436 \| 0.9099 \| 0.9857 \| 0.9463 \| 834 \|
	\| 🇮🇳 Hindi \| 0.9398 \| 0.9276 \| 0.9603 \| 0.9436 \| 1295 \|
	\| 🇳🇱 Dutch \| 0.9279 \| 0.8959 \| 0.9738 \| 0.9332 \| 1401 \|
	\| 🇳🇴 Norwegian \| 0.9165 \| 0.8717 \| 0.9801 \| 0.9227 \| 1976 \|
	\| 🇨🇳 Chinese \| 0.9164 \| 0.8859 \| 0.9608 \| 0.9219 \| 945 \|
	\| 🇫🇮 Finnish \| 0.9158 \| 0.8746 \| 0.9702 \| 0.9199 \| 1010 \|
	\| 🇬🇧 English \| 0.9086 \| 0.8507 \| 0.9801 \| 0.9108 \| 2845 \|
	\| 🇮🇩 Indonesian \| 0.9022 \| 0.8514 \| 0.9707 \| 0.9071 \| 971 \|
	\| 🇮🇹 Italian \| 0.9015 \| 0.8562 \| 0.9640 \| 0.9069 \| 782 \|
	\| 🇵🇱 Polish \| 0.9068 \| 0.8619 \| 0.9568 \| 0.9069 \| 976 \|
	\| 🇵🇹 Portuguese \| 0.8956 \| 0.8410 \| 0.9676 \| 0.8999 \| 1398 \|
	\| 🇩🇰 Danish \| 0.8973 \| 0.8517 \| 0.9644 \| 0.9045 \| 779 \|
	\| 🇪🇸 Spanish \| 0.8888 \| 0.8304 \| 0.9681 \| 0.8940 \| 1295 \|
	\| 🇮🇳 Marathi \| 0.8850 \| 0.8762 \| 0.9008 \| 0.8883 \| 774 \|
	\| 🇷🇺 Russian \| 0.8748 \| 0.8318 \| 0.9547 \| 0.8890 \| 1470 \|
	\| 🇺🇦 Ukrainian \| 0.8794 \| 0.8164 \| 0.9587 \| 0.8819 \| 929 \|
	\| 🇻🇳 Vietnamese \| 0.8645 \| 0.8135 \| 0.9439 \| 0.8738 \| 1004 \|
	\| 🇸🇦 Arabic \| 0.8490 \| 0.7965 \| 0.9439 \| 0.8639 \| 947 \|
	\| 🇮🇳 Bengali \| 0.7940 \| 0.7874 \| 0.7939 \| 0.7907 \| 1000 \|


	> 📊 Evaluated on 25,000+ Multilingual utterances from diverse conversational contexts

	## ⚡️ Speed Analysis

	<img src="./performance_analysis.png" alt="Alt text" width="600" height="400"/>

	## 🔧 Train & Test Scripts

	<div align="center">

	[![Train Script](https://img.shields.io/badge/Colab-Train%20Script-brightgreen?logo=google-colab)](https://colab.research.google.com/drive/1WEVVAzu1WHiucPRabnyPiWWc-OYvBMNj) [![Test Script](https://img.shields.io/badge/Colab-Test%20Script-blue?logo=google-colab)](https://colab.research.google.com/drive/19ZOlNoHS2WLX2V4r5r492tsCUnYLXnQR)

	</div>

	## 🛠️ Installation

	To use this model, you will need to install the following libraries.

	```bash
	pip install onnxruntime transformers huggingface_hub
	```

	## 🚀 Quick Start

	You can run inference directly from Hugging Face repository.

	```python
	import numpy as np
	import onnxruntime as ort
	from transformers import AutoTokenizer
	from huggingface_hub import hf_hub_download

	class TurnDetector:
	def __init__(self, repo_id="videosdk-live/Namo-Turn-Detector-v1-Multilingual"):
	"""
	Initializes the detector by downloading the model and tokenizer
	from the Hugging Face Hub.
	"""
	print(f"Loading model from repo: {repo_id}")

	# Download the model and tokenizer from the Hub
	# Authentication is handled automatically if you are logged in
	model_path = hf_hub_download(repo_id=repo_id, filename="model_quant.onnx")
	self.tokenizer = AutoTokenizer.from_pretrained(repo_id)

	# Set up the ONNX Runtime inference session
	self.session = ort.InferenceSession(model_path)
	self.max_length = 8192
	print("✅ Model and tokenizer loaded successfully.")

	def predict(self, text: str) -> tuple:
	"""
	Predicts if a given text utterance is the end of a turn.
	Returns (predicted_label, confidence) where:
	- predicted_label: 0 for "Not End of Turn", 1 for "End of Turn"
	- confidence: confidence score between 0 and 1
	"""
	# Tokenize the input text
	inputs = self.tokenizer(
	text,
	truncation=True,
	max_length=self.max_length,
	return_tensors="np"
	)

	# Prepare the feed dictionary for the ONNX model
	feed_dict = {
	"input_ids": inputs["input_ids"],
	"attention_mask": inputs["attention_mask"]
	}

	# Run inference
	outputs = self.session.run(None, feed_dict)
	logits = outputs[0]

	probabilities = self._softmax(logits[0])
	predicted_label = np.argmax(probabilities)
	confidence = float(np.max(probabilities))

	return predicted_label, confidence

	def _softmax(self, x, axis=None):
	if axis is None:
	axis = -1
	exp_x = np.exp(x - np.max(x, axis=axis, keepdims=True))
	return exp_x / np.sum(exp_x, axis=axis, keepdims=True)

	# --- Example Usage ---
	if __name__ == "__main__":
	detector = TurnDetector()

	sentences = [
	"They're often made with oil or sugar.", # Expected: End of Turn
	"I think the next logical step is to", # Expected: Not End of Turn
	"What are you doing tonight?", # Expected: End of Turn
	"The Revenue Act of 1862 adopted rates that increased with", # Expected: Not End of Turn
	]

	for sentence in sentences:
	predicted_label, confidence = detector.predict(sentence)
	result = "End of Turn" if predicted_label == 1 else "Not End of Turn"
	print(f"'{sentence}' -> {result} (confidence: {confidence:.3f})")
	print("-" * 50)
	```


	## 🤖 VideoSDK Agents Integration

	Integrate this turn detector directly with VideoSDK Agents for production-ready conversational AI applications.

	```python
	from videosdk_agents import NamoTurnDetectorV1, pre_download_namo_turn_v1_model

	#download model
	pre_download_namo_turn_v1_model()

	# Initialize Multilingual turn detector for VideoSDK Agents
	turn_detector = NamoTurnDetectorV1()
	```

	> 📚 [Complete Integration Guide](https://docs.videosdk.live/ai_agents/plugins/namo-turn-detector) - Learn how to use `NamoTurnDetectorV1` with VideoSDK Agents

	## 📖 Citation

	```bibtex
	@model{namo_turn_detector_en_2025,
	title={Namo Turn Detector v1: Multilingual},
	author={VideoSDK Team},
	year={2025},
	publisher={Hugging Face},
	url={https://huggingface.co/videosdk-live/Namo-Turn-Detector-v1-Multilingual},
	note={ONNX-optimized mmBERT for turn detection in 23 Languages}
	}
	```

	## 📄 License

	This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.

	<div align="center">

	Made with ❤️ by the VideoSDK Team

	[![VideoSDK](https://img.shields.io/badge/VideoSDK-Live-blue)](https://videosdk.live)

	</div>