You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

If you are using this model as part of a horizontal voice agent platform, you agree not to set Vogent-Turn-80M as a default option, and to require users to select an option labeled 'Vogent Turn Detector' if they would like to use the model.

Log in or Sign Up to review the conditions and access this model content.

Vogent-Turn-80M

State-of-the-art multimodal turn detection model for voice AI systems, achieving 94.1% accuracy by combining acoustic and linguistic signals for real-time conversational applications.

Technical Report

HF Space

Inference Code

Model Details

Model Description

Vogent-Turn-80M is a multimodal turn detection model that addresses the critical challenge of determining when a speaker has finished their turn in a conversation. Unlike traditional approaches that rely solely on audio or text, Vogent-Turn-80M processes both acoustic features (via Whisper encoder) and semantic context to make accurate predictions in real-time (~7ms on T4 GPU).

  • Developed by: Vogent AI
  • Model type: Multimodal Turn Detection (Binary Classification)
  • Language(s) (NLP): English
  • License: Vogent-Turn-80M is licensed under a modified Apache-2.0 license; horizontal voice agent platforms may not select Vogent-Turn-80M as the default turn-detection model, and any end-users who which to use the model must be required to select 'Vogent Turn Detector.' Otherwise, standard Apache 2.0 provisions apply.
  • Finetuned from model: SmolLM2-135M (reduced to 80M parameters by using only first 12 layers)

Model Sources

Uses

Vogent-Turn-80M is designed for real-time turn detection in voice assistant applications, determining when a user has finished speaking to enable natural conversational flow without premature interruptions or awkward delays.

Bias, Risks, and Limitations

Technical Limitations:

  • English-only support; turn-taking conventions vary across languages and cultures
  • CPU inference may be too slow for some real-time applications

How to Get Started with the Model

For complete installation and usage instructions, visit: https://github.com/vogent/vogent-turn

Quick Install

# Clone the repository
git clone https://github.com/vogent/vogent-turn.git
cd vogent-turn

# Install in development mode
pip install -e .

Basic Usage

from vogent_turn import TurnDetector
import soundfile as sf
import urllib.request

# Initialize detector
detector = TurnDetector(compile_model=True, warmup=True)

# Download and load audio
audio_url = "https://storage.googleapis.com/voturn-sample-recordings/incomplete_number_sample.wav"
urllib.request.urlretrieve(audio_url, "sample.wav")
audio, sr = sf.read("sample.wav")

# Run turn detection with conversational context
result = detector.predict(
    audio,
    prev_line="What is your phone number",
    curr_line="My number is 804",
    sample_rate=sr,
    return_probs=True,
)

print(f"Turn complete: {result['is_endpoint']}")
print(f"Done speaking probability: {result['prob_endpoint']:.1%}")

Available Interfaces

  • Python Library: Direct integration with TurnDetector class
  • CLI Tool: vogent-turn-predict speech.wav --prev "What is your phone number" --curr "My number is 804"

See the GitHub repository for detailed documentation, performance benchmarks, and advanced usage.

Training Details

Training Data

The model was trained on a diverse dataset combining human-collected and synthetic conversational data:

Training Procedure

Preprocessing

  • Audio: Last 8 seconds extracted via Whisper-Tiny encoder β†’ ~400 audio tokens
  • Text: Full conversational context including assistant and user utterances
  • Labels: Binary classification (turn complete/incomplete)
  • Multimodal fusion: Audio embeddings projected into LLM's input space and concatenated with text

Training Hyperparameters

  • Training regime: fp16 mixed precision
  • Base model initialization: SmolLM2-135M (first 12 layers)
  • Architecture modifications: Reduced from 135M to ~80M parameters through layer ablation

Speeds, Sizes, Times

  • Model size: ~80M parameters

Evaluation

Testing Data, Factors & Metrics

Testing Data

Internal test set covering diverse conversational scenarios and edge cases where audio-only or text-only approaches fail.

  • Accuracy: 94.1%
  • AUPRC: 0.975

Technical Specifications

Model Architecture and Objective

Architecture:

  • Audio Encoder: Whisper-Tiny (processes up to 8 seconds of 16kHz audio)
  • Text Model: SmolLM-135M (12 layers, ~80M parameters)
  • Multimodal Fusion: Audio embeddings projected into LLM's input space
  • Classifier: Binary classification head (turn complete/incomplete)

Processing Flow:

  1. Audio (16kHz PCM) β†’ Whisper Encoder β†’ Audio Embeddings (~400 tokens)
  2. Text Context β†’ SmolLM Tokenizer β†’ Text Embeddings
  3. Concatenate embeddings β†’ SmolLM Transformer β†’ Last token hidden state
  4. Linear Classifier β†’ Softmax β†’ [P(continue), P(endpoint)]

Compute Infrastructure

Hardware

Optimization Features:

  • torch.compile with max-autotune mode
  • Dynamic tensor shapes without recompilation
  • Pre-warmed bucket sizes (64, 128, 256, 512, 1024)

Software

  • Framework: PyTorch with torch.compile
  • Audio processing: Whisper encoder (up to 8 seconds)

Citation

BibTeX:

@misc{voturn2025,
  title={Vogent-Turn-80M: State-of-the-Art Turn Detection for Voice Agents},
  author={Varadarajan, Vignesh and Vytheeswaran, Jagath},
  year={2025},
  publisher={Vogent AI},
  howpublished={\url{https://huggingface.co/vogent/Vogent-Turn-80M}},
  note={Blog: \url{https://blog.vogent.ai/posts/voturn-80m-state-of-the-art-turn-detection-for-voice-agents}}
}

More Information

Vogent-Turn-80M is part of Vogent's comprehensive voice AI platform.

Resources:

Upcoming releases:

  • Int8 quantized model for faster CPU deployment
  • Multilingual versions
  • Domain-specific adaptations

Model Card Authors

Vogent AI Team

Model Card Contact

Downloads last month
213
Safetensors
Model size
79.2M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Space using vogent/Vogent-Turn-80M 1