You need to agree to use this model only for research or education purposes under Reactive AI Model & Architecture License (RAML) v1.0
The repository will be available instantly after accepting license terms
Accept Reactive AI Model & Architecture License (RAML) v1.0 terms to access the repository and use model. Reactive Transformer (pending patent #P.453260) is available for free for non-commercial usage. For commercial usage please contact Reactive AI at [email protected]
Log in or Sign Up to review the conditions and access this model content.
RxT-Beta Decoder Base (2.85B A190M)
Training & docs in progress
Progress ~40B/250B tokens
RxT-Beta is the world's first real-scale stateful Reactive Language Model (RxLM), made to confirm new Reactive Transformer (RxT) scaling laws and solve all the biggest stateless LLMs problems. RxT models are natively conversational (and agentic) - instead of reprocessing all the conversation history (chat template) like all the LLMs, it processes only single interactions in real-time and moves the context to dedicated embedding-based memory, that's updated asynchronously between the interactions. It introduces unique features like:
- infinite conversation & global context through Mixture-of-Memory (MoM)
- live continual learning from interactions in real-time
- true real-time processing with near-zero latency
- linear conversation cost scaling
- fixed computational cost and memory usage for each interaction
- increasing quality of responses with subsequent steps of dialogue, without "long-term hallucinations"
- natively encoded memory, impossible to read without the model
- extreme pre-training efficiency
In first small scale experiments RxT-Alpha models achieved about 50% higher accuracy and almost 2x lower perplexity, than the same size stateless decoder-only baseline, trained on the same simple synthetic dataset (additionally, decoder-only model was pre-trained on 5x more tokens). These results were then confirmed on small 10B tokens subset of real-world data and ~0.3B models (RxT-Beta Micro), where RxT advantage was even bigger. These promising results, along with all the unique features, demonstrate that Reactive Transformer is a revolutionary generational leap and a crucial milestone on the path to Artificial General Intelligence (AGI). Of course, if we will confirm this at scale, which is what we plan to do with RxT-Beta.
The goal is to compete with ~1-3B params dense stateless LLMs, pre-trained on trillions tokens, using model with only 190M active parameters and about 250B pre-training tokens, and significantly outperform them on long multi-turn conversations.
Base models
Reactive Transformer models require new dedicated training pipeline to handle its asynchronous memory and reversed decoder-encoder order. Base models are result of the first supervised stage - Joint LM Pre-Training with "cheated context" teacher forcing (more info in Training Process section).
Base decoder (this model) is not a typical generative model. It requires further training and should be connected with encoder and memory attention network, so this model is only the starting point for next stages. It's pre-trained for general knowledge (with focus on reasoning) using textbook quality datasets and it could be further fine-tuned for custom use cases (under the terms of the RAML v1.0 license).
Decoder architecture
- layers: 25 (21 stateful MoE + 3 stateless MoE + 1 stateless dense)
- dim: 512
- self-attention: Gated Sparse Query Attention (SQA) 8/16 query heads & 4/16 key/value heads
- memory cross-attention: Sparse Query Attention (SQA) 8/16 query heads & 4/16 key/value heads
- feed forward: Sparse Mixture-of-Experts (MoE) with gated shared experts
- routed experts: 384
- active experts: 10
- routed expert dim: 192
- shared experts: 2 with softmax gating
- shared expert dim: 384
- activation: SwiGLU
- dense layer: 1536 dim with SwiGLU activation
- vocab: 65k (english + polish)
- params: 2.85B with 190M activated per token
- Downloads last month
- 93