You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

If you are using this model as part of a horizontal voice agent platform, you agree not to set Vogent-Turn-80M as a default option, and to require users to select an option labeled 'Vogent Turn Detector' if they would like to use the model.

Vogent-Turn-80M

State-of-the-art multimodal turn detection model for voice AI systems, achieving 94.1% accuracy by combining acoustic and linguistic signals for real-time conversational applications.

Technical Report

HF Space

Inference Code

Model Details

Model Description

Vogent-Turn-80M is a multimodal turn detection model that addresses the critical challenge of determining when a speaker has finished their turn in a conversation. Unlike traditional approaches that rely solely on audio or text, Vogent-Turn-80M processes both acoustic features (via Whisper encoder) and semantic context to make accurate predictions in real-time (~7ms on T4 GPU).

Developed by: Vogent AI
Model type: Multimodal Turn Detection (Binary Classification)
Language(s) (NLP): English
License: Vogent-Turn-80M is licensed under a modified Apache-2.0 license; horizontal voice agent platforms may not select Vogent-Turn-80M as the default turn-detection model, and any end-users who which to use the model must be required to select 'Vogent Turn Detector.' Otherwise, standard Apache 2.0 provisions apply.
Finetuned from model: SmolLM2-135M (reduced to 80M parameters by using only first 12 layers)

Model Sources

GitHub Repository: https://github.com/vogent/vogent-turn
Blog post: https://blog.vogent.ai/posts/voturn-80m-state-of-the-art-turn-detection-for-voice-agents

Uses

Vogent-Turn-80M is designed for real-time turn detection in voice assistant applications, determining when a user has finished speaking to enable natural conversational flow without premature interruptions or awkward delays.

Bias, Risks, and Limitations

Technical Limitations:

English-only support; turn-taking conventions vary across languages and cultures
CPU inference may be too slow for some real-time applications

How to Get Started with the Model

For complete installation and usage instructions, visit: https://github.com/vogent/vogent-turn

Quick Install

# Clone the repository
git clone https://github.com/vogent/vogent-turn.git
cd vogent-turn

# Install in development mode
pip install -e .

Basic Usage

from vogent_turn import TurnDetector
import soundfile as sf
import urllib.request

# Initialize detector
detector = TurnDetector(compile_model=True, warmup=True)

# Download and load audio
audio_url = "https://storage.googleapis.com/voturn-sample-recordings/incomplete_number_sample.wav"
urllib.request.urlretrieve(audio_url, "sample.wav")
audio, sr = sf.read("sample.wav")

# Run turn detection with conversational context
result = detector.predict(
    audio,
    prev_line="What is your phone number",
    curr_line="My number is 804",
    sample_rate=sr,
    return_probs=True,
)

print(f"Turn complete: {result['is_endpoint']}")
print(f"Done speaking probability: {result['prob_endpoint']:.1%}")

Available Interfaces

Python Library: Direct integration with TurnDetector class
CLI Tool: vogent-turn-predict speech.wav --prev "What is your phone number" --curr "My number is 804"

See the GitHub repository for detailed documentation, performance benchmarks, and advanced usage.

Training Details

Training Data

The model was trained on a diverse dataset combining human-collected and synthetic conversational data:

Training Procedure

Preprocessing

Audio: Last 8 seconds extracted via Whisper-Tiny encoder → ~400 audio tokens
Text: Full conversational context including assistant and user utterances
Labels: Binary classification (turn complete/incomplete)
Multimodal fusion: Audio embeddings projected into LLM's input space and concatenated with text

Training Hyperparameters

Training regime: fp16 mixed precision
Base model initialization: SmolLM2-135M (first 12 layers)
Architecture modifications: Reduced from 135M to ~80M parameters through layer ablation

Speeds, Sizes, Times

Model size: ~80M parameters

Evaluation

Testing Data, Factors & Metrics

Testing Data

Internal test set covering diverse conversational scenarios and edge cases where audio-only or text-only approaches fail.

Accuracy: 94.1%
AUPRC: 0.975

Technical Specifications

Model Architecture and Objective

Architecture:

Audio Encoder: Whisper-Tiny (processes up to 8 seconds of 16kHz audio)
Text Model: SmolLM-135M (12 layers, ~80M parameters)
Multimodal Fusion: Audio embeddings projected into LLM's input space
Classifier: Binary classification head (turn complete/incomplete)

Processing Flow:

Audio (16kHz PCM) → Whisper Encoder → Audio Embeddings (~400 tokens)
Text Context → SmolLM Tokenizer → Text Embeddings
Concatenate embeddings → SmolLM Transformer → Last token hidden state
Linear Classifier → Softmax → [P(continue), P(endpoint)]

Compute Infrastructure

Hardware

Optimization Features:

torch.compile with max-autotune mode
Dynamic tensor shapes without recompilation
Pre-warmed bucket sizes (64, 128, 256, 512, 1024)

Software

Framework: PyTorch with torch.compile
Audio processing: Whisper encoder (up to 8 seconds)

Citation

BibTeX:

@misc{voturn2025,
  title={Vogent-Turn-80M: State-of-the-Art Turn Detection for Voice Agents},
  author={Varadarajan, Vignesh and Vytheeswaran, Jagath},
  year={2025},
  publisher={Vogent AI},
  howpublished={\url{https://huggingface.co/vogent/Vogent-Turn-80M}},
  note={Blog: \url{https://blog.vogent.ai/posts/voturn-80m-state-of-the-art-turn-detection-for-voice-agents}}
}

More Information

Vogent-Turn-80M is part of Vogent's comprehensive voice AI platform.

Resources:

Full documentation and code: https://github.com/vogent/vogent-turn
Platform access: https://vogent.ai
Enterprise solutions: Contact [email protected]

Upcoming releases:

Int8 quantized model for faster CPU deployment
Multilingual versions
Domain-specific adaptations

Model Card Authors

Vogent AI Team

Model Card Contact

GitHub Repository: https://github.com/vogent/vogent-turn
GitHub Issues: https://github.com/vogent/vogent-turn/issues
Website: https://vogent.ai

Downloads last month: 213

Safetensors

Model size

79.2M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

vogent
/

Vogent-Turn-80M