QVAC Genesis I Pretrained Model

Key Highlights

Pretrained on the Largest Synthetic Educational Dataset
This model has been pretrained on Tether's QVAC Genesis I, the largest synthetic dataset released for educational LLM pre-training.

The model was trained from scratch on approximately 40B tokens of multi-domain educational text, using BF16 mixed precision and a 4,096-token context window. Training was made with a Qwen3-family 1.7B-parameter decoder-only transformer architecture.

Checkpoints are provided in standard Hugging Face format for easy inference, continual pre-training, and fine-tuning.
Multi-Domain Educational Coverage
Because the model is trained on QVAC Genesis I, it inherits curriculum-aligned coverage across:
- Mathematics
- Physics
- Biology
- Medicine
Superior Benchmark Performance
Leveraging QVAC Genesis I as its training foundation, the model consistently outperforms baselines in:
- Reasoning tasks
- Knowledge assessments
- Subject-specific QA
First Publicly Released Education-Specific Pretrained Model
This is the first open-source pretrained model built directly on a rigorously validated synthetic dataset for education, offering deep and comprehensive STEM coverage. abilities

Intended Uses

Continual pre-training or fine-tuning for educational applications (STEM-focused tutoring, QA systems, curriculum support)
Benchmarking reasoning and subject-specific QA performance
Research into synthetic dataset–driven LLM training

Model Details

Model Description

Developed by: Qvac by Tether
Model type: Decoder-only Transformer (causal LM)
Language(s) (NLP): Primarily English
License: Apache-2.0
Finetuned from model: None (trained from scratch)
Intended stage: Base pre-trained model (no SFT / RLHF alignment)

Dataset Details

Repository: https://huggingface.co/qvac/genesisI-model
Paper / Blog : https://huggingface.co/blog/qvac/genesis-i

Uses

Direct Use

General language modeling: next-token prediction, continuation, summarization, drafting.
Research baseline for scaling, data ablations, or tokenizer studies.

Downstream Use (recommended)

CPT Continued Pre-Training on more tokens.
SFT for assistants, domain experts, or task-specific models.
Preference optimization / RLHF for safer, more helpful behavior.
Adapters/LoRA for efficient domain specialization.

Out-of-Scope Use

High-stakes decision-making (medical/financial/legal).
Safety-critical or autonomous control systems.
Unfiltered end-user chat deployment without alignment / safety layers.
Any use that violates applicable laws or platform policies.

Bias, Risks, and Limitations

Bias & toxicity: May reflect or amplify biases present in web text.
Hallucinations: Can produce confident but incorrect statements or citations.
Security / privacy: May emit continous random strings.
Context limit: 4,096 tokens; longer inputs require chunking.

Recommendations

Disclose limitations to downstream users.
Research Model : Not to be used in production use cases.

How to Get Started

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "qvac/genesisI-model"

tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,   # trained with BF16 mixed precision
    device_map="auto"
)

prompt = "Explain precision vs. recall in one paragraph."
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    top_p=0.9,
    temperature=0.7
)
print(tok.decode(out[0], skip_special_tokens=True))

Tip: On consumer GPUs, consider loading in float16 or using 4/8-bit quantization (e.g., bitsandbytes/AutoGPTQ).

Training Details

Training Data

Size: ~40B tokens, single epoch.
Domains: Mixed general + STEM/technical sources (expository text, problem sets, references).
Format: Hugging Face Datasets (Arrow).
Tokenizer: Qwen3 tokenizer.
Processing: Normalization, filtering of extremes, document chunking to fit 4096 context, sequence packing where applicable.
Dataset Card: Coming Soon

Training Procedure

Preprocessing

Unicode normalization, whitespace cleanup, control-char stripping.
Length filtering; chunking to 4096; optional packing to improve throughput.

Training Hyperparameters

Optimizer: AdamW (β₁=0.9, β₂=0.95), weight decay 0.01
Learning rate: 2e-4 (linear warmup)
Warmup: 600 steps (~10% of max steps)
Precision: BF16 mixed precision
Gradient clipping: 1.0
Seed: 42
Logging: Every 50 steps
Eval: Every 500 steps (20 iters)
Checkpointing: Every 1000 steps (sharded; full optimizer/state resume)

Speeds, Sizes, Times

Per-GPU micro-batch: 4
Grad accumulation: 8
World size: 480 GPUs
Effective global batch: 4 × 8 × 480 = 15,360 samples/step
Step time (indicative): ~1.5 s/step (cluster/I-O dependent)

Stability & Performance

Activation checkpointing.
Fused kernels where available (fused attention/optimizer).
FlashAttention-2 on H100.
torch.compile (safe mode) after warmup stability.
Dynamic loss scaling to mitigate BF16 overflow.
Fragmentation mitigations (e.g., max_split_size_mb=512, expandable segments, GC threshold ~0.8).

Multi-Node GPU Setup

Cluster: ~60 nodes, each 8× NVIDIA H100 80GB (total 480 GPUs), ~800 GB RAM/node.
Scheduler: Slurm (priority partition, exclusive allocation, 72-hour limit).
Launch: srun + PyTorch DDP (world size 480; ranks bound via Slurm env).
Storage: Sharded checkpoints; periodic saves for robust resume.
Networking: NCCL over InfiniBand with UCX
- NCCL_IB_DISABLE=0, NCCL_IB_HCA="mlx5*", NCCL_SOCKET_IFNAME=<ib0/enoX>, NCCL_BLOCKING_WAIT=1
- Watchdog ~720s for fail-fast on fabric issues
I/O: Async dataset prefetching; pinned FS threads.
Observability: W&B + structured logs (throughput, TFLOPs/GPU, mem, step time).
Reproducibility: Fixed seeds; exact launch scripts/env logged; effective tokens/step reported.

Final checkpoint converted to Hugging Face format for plug-and-play inference.

Evaluation

Testing Data, Factors & Metrics

Testing data: Standard academic suites (e.g., EleutherAI LM Evaluation Harness).
Factors: Domain/topic (STEM vs. general), task type (multi-choice vs. open-ended).
Metrics: Accuracy (MCQ), EM/F1 (QA), plus task-native metrics.

Suggested suite (edit as applicable):

General knowledge & reasoning: MMLU (STEM subsets), ARC-E/ARC-C, HellaSwag, PIQA, Winogrande
Math/coding (optional): GSM8K, HumanEval
Reading comprehension (optional): BoolQ, RACE

Results

To be released with an evaluated checkpoint and harness version pin. Include tables with exact versions, seeds, and commit hashes.

Summary

Base LM targets broad generalization at 41B tokens.
Expect material gains after SFT + preference optimization for target tasks.

Technical Specifications

Model Architecture and Objective

Architecture: Qwen3-style decoder-only Transformer
Parameters: ~1.7B
Context length: 4,096 tokens
Positional encoding: Rotary / relative (specify)
Attention: Multi-head scaled dot-product; FlashAttention-2 enabled on H100
Activation: GELU / SiLU (specify)
Norms: RMSNorm / LayerNorm (specify)
Objective: Causal LM (next-token prediction)

Compute Infrastructure

Hardware

60 nodes × 8× H100 80GB, ~800 GB RAM/node, InfiniBand fabric.

Software

PyTorch ≥ 2.1 (CUDA 12.x), FlashAttention-2, UCX/NCCL
Slurm for orchestration; W&B for logging
(Optional) DeepSpeed/Zero-3 for training; HF conversion post-train

Reproducibility (Launch Sketch)

# Slurm (illustrative)
srun -N 60 -n 480 --ntasks-per-node=8 --gpus-per-task=1 \
  --cpus-per-task=8 --mem=0 \
  bash -lc '
  export NCCL_IB_DISABLE=0
  export NCCL_IB_HCA="mlx5*"
  export NCCL_SOCKET_IFNAME=ib0
  export NCCL_BLOCKING_WAIT=1
  export TORCH_DISTRIBUTED_DEBUG=DETAIL

  python train.py \
    --model qwen3_1p7b_from_scratch \
    --tokenizer qwen3 \
    --data_path /path/to/arrow \
    --context_length 4096 \
    --optimizer adamw --weight_decay 0.01 \
    --lr 2e-4 --warmup_steps 600 \
    --precision bf16-mixed \
    --micro_batch_size 4 \
    --grad_accum_steps 8 \
    --eval_every 500 --log_every 50 \
    --ckpt_every 1000 \
    --activation_checkpointing \
    --flash_attn 2 \
    --compile safe \
    --seed 42
'

Conversion & Inference

Checkpoints are HF-compatible: load with AutoModelForCausalLM.
For memory-limited environments, prefer half-precision or 4/8-bit loading.
Distribute as safetensors for integrity.

Changelog

v0.1 (2025-11-17): Initial public release — 40B-token 1-epoch pretrain; HF conversion.

Downloads last month: 69

Safetensors

Model size

2B params

Tensor type

BF16

qvac
/

genesis-i-model