i3 Model - Memory-Optimized Efficient Conversational Language Model

Model Description

The i3 Model is a memory-optimized language model designed for conversational understanding. This version uses streaming tokenization to minimize RAM usage during training.

Model Statistics

  • Vocabulary Size: 4,466 (variable-length chunks)
  • Hidden Dimension: 512
  • Number of Layers: 24
  • Max Sequence Length: 256
  • Total Parameters: 22,640,626
  • Tokenization: Memory-efficient variable-length chunking (2-3 characters)

To use the model check the user.py.

Key Features

  1. Memory-Optimized: Streaming tokenization reduces RAM usage significantly
  2. Proprietary Hybrid Architecture: Advanced sequence processing with linear complexity
  3. Variable-Length Tokenization: Smart chunking strategy for better compression
  4. Conversational Focus: Specialized for dialogue and emotional understanding

Training Details

  • Dataset: TinyChat
  • Training Objective: Next-token prediction with proprietary optimization
  • Framework: PyTorch
  • Memory Optimization: Streaming dataset processing

Technical Report: i3 Pre-training

  1. Executive Summary The i3 model, a small-scale text generation architecture, successfully completed its initial pre-training phase. This training was conducted on an NVIDIA GeForce RTX 3060 and required approximately 17 hours of continuous processing. The resulting model artifacts are configured for deployment on the HuggingFace platform. The model is characterized by a compact architecture featuring 24 layers and a hidden dimension of 512, paired with a custom "chunk" tokenization strategy designed for efficiency on conversational data.
  2. Model Configuration and Architecture The i3Model architecture is designed to be highly efficient, likely incorporating elements of a State Space Model (SSM) due to the low-rank and state-space parameters (rank and d_state).
    Parameter Value Description
    Model Type i3Model Custom, high-efficiency architecture (likely SSM-enhanced).
    Hidden Dimension (d_{model}) 512 The size of the vector space for internal representations.
    Number of Layers (n_{layers}) 24 The depth of the model's processing blocks.
    Attention Heads (n_{heads}) 16 The number of parallel attention mechanisms (if applicable).
    State Dimension (d_{state}) 64 Indicates the size of the recurrent state, common in SSMs.
    Rank 128 Potentially used for low-rank projection in attention or state mechanisms.
    Max Sequence Length 256 The maximum number of tokens/chunks the model can process at once.
    Vocabulary Size 4,466 The total number of unique chunks/tokens in the vocabulary.
  3. Training Environment and Duration The training phase was characterized by high hardware efficiency, achieving a complete pre-training run on consumer-grade hardware in a short timeframe.
  • Hardware Used: NVIDIA GeForce RTX 3060 (12GB VRAM assumed).
  • Total Training Time: Approximately 17 hours.
  • Framework: PyTorch (with HuggingFace Transformers for generation of final files).
  1. Training Data and Procedure Dataset The model was pre-trained using the TinyChat dataset, which comprised 1,000,000 conversations. This suggests the model is optimized for rapid, short-form conversational tasks. Tokenization Strategy A crucial element of the model's efficiency is its custom tokenization approach:
  • Tokenizer Type: chunk
  • Strategy: variable_2_3
  • Vocabulary: The vocabulary size is notably small (4,466 chunks), indicating that the tokenizer is designed to aggregate common sequences of text into single tokens, significantly reducing the effective sequence length and computational cost during training. Performance Metrics Training showed consistent iteration steps, with the log reporting final metrics as the process concluded:
    Metric Range (Last 500 Iterations) Observation
    Loss 1.98 - 2.27 Training loss remained relatively stable, suggesting convergence towards the end of the run.
    Perplexity (PPL) 7.29 - 9.70 Perplexity is a measure of how well the model predicts the next token. This range is typical for raw pre-training logs and indicates the model has learned basic sequence dependencies.
    Time per Iteration \sim 8.2 \text{s} - 12.7 \text{s} Processing time per iteration shows a sustained and efficient training throughput.
  1. Deliverables Upon completion, the necessary files for deployment were generated into the i3_model_hf/ directory, ensuring immediate compatibility with the HuggingFace ecosystem:
  • pytorch_model.bin (Model Weights)
  • config.json (Model Configuration)
  • tokenizer.json (Vocabulary File)
  • tokenizer_config.json (Tokenizer Configuration) The model is now ready for fine-tuning on a specific downstream task or for evaluation of its foundational text generation capabilities.
Downloads last month
70
Safetensors
Model size
22.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train FlameF0X/i3-22m

Collection including FlameF0X/i3-22m