Spectra-561B-27B-768E: Model Overview

Omnira Spectra-561B-27B-768E is a state-of-the-art open-source omni-modal model developed by Omnira's Artificial Intelligence Team. Designed for real-time audio-visual interaction, it integrates text, image, video, and audio processing into a single, unified framework.

Core Architecture

The model utilizes a massive Mixture-of-Experts (MoE) structure that balances extreme scale with computational efficiency.

  • Parameters: 561B total, with 27B activated per token.
  • Backbone: Built on Spectra Architecture, featuring a "Shortcut-connected MoE" (ScMoE) that overlaps computation and communication to reduce latency.
  • Dynamic Computation: Uses "zero-computation experts" and a PID-controller to allocate resources based on the importance of each token.
  • Context Window: Supports up to 128K tokens, enabling long-term memory and complex temporal reasoning.

Multimodal Integration

Component Technology Function
Vision Encoder VidEnc (637M) Processes images and videos natively; supports arbitrary aspect ratios and resolutions.
Audio System Audio-Code-S Discretizes audio into semantic and acoustic codebooks at 16.67 Hz.
Streaming Encoder FSMN-based Uses Feedforward Sequential Memory Networks for low-latency audio processing.
Fusion Strategy Early-Fusion Aligning all modalities (text, audio, visual) within a shared latent space for unified reasoning.

Key Performance Highlights

  • Omni-Modality: Outperforms open-source rivals (e.g., Qwen3-Omni) on OmniBench (61.38) and WorldSense (60.89).
  • Real-Time Interaction: Achieves millisecond-level latency for streaming speech generation and video interaction.
  • Vision & Video: Competitive with proprietary models like Gemini-2.5-Flash in document understanding (DocVQA) and video reasoning (VideoMME).
  • Audio Excellence: Shows superior performance in Mandarin and English ASR (Automatic Speech Recognition) compared to GPT-4o-Audio.

Training & Efficiency

  1. Curriculum-Inspired Training: A 6-stage pipeline starting from text-only, then gradually injecting speech, image, and finally video data.
  2. Modality-Decoupled Parallelism (MDP): Separates encoder and LLM optimization to maintain over 90% of text-only training throughput even with complex multimodal data.

Technical Specifications

Feature Specification
Total Parameters 561B
Activated Parameters 27B
Expert Configuration 512 Routed; 256 Zero Experts
Context Window 128K tokens
Primary Tasks Audio, Visual, Text, Video-Continuation

Quick Start

Installation

# install Spectra-Omni environment
conda create -n spectra python=3.10
conda activate spectra

# install dependencies
pip install torch transformers flash_attn

Usage

Due to its massive scale (561B), Spectra requires multi-node clusters or high-memory instances (e.g., 16×H800) for inference in BF16.

from spectra_omni import SpectraModel

model = SpectraModel.from_pretrained("thenexthub/Spectra-561B-27B-768E")
# Seamlessly process audio, image, and text inputs

License Agreement

The model weights are released under the Omnira License. This license does not grant any rights to use Omnira trademarks or patents.

Citation

@misc{omnira2026spectra,
    title={Spectra-561B-27B-768E: Unified Omni-modal Intelligence}, 
    author={Omnira}, 
    year={2026}, 
    url={https://github.com/theomnira/Spectra-Omni}, 
}
Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including thenexthub/Spectra-561B-27B-768E