QVAC Genesis I Pretrained Model

Key Highlights

  • Pretrained on the Largest Synthetic Educational Dataset
    This model has been pretrained on Tether's QVAC Genesis I, the largest synthetic dataset released for educational LLM pre-training.

    The model was trained from scratch on approximately 40B tokens of multi-domain educational text, using BF16 mixed precision and a 4,096-token context window. Training was made with a Qwen3-family 1.7B-parameter decoder-only transformer architecture.

    Checkpoints are provided in standard Hugging Face format for easy inference, continual pre-training, and fine-tuning.

  • Multi-Domain Educational Coverage
    Because the model is trained on QVAC Genesis I, it inherits curriculum-aligned coverage across:

    • Mathematics
    • Physics
    • Biology
    • Medicine
  • Superior Benchmark Performance
    Leveraging QVAC Genesis I as its training foundation, the model consistently outperforms baselines in:

    • Reasoning tasks
    • Knowledge assessments
    • Subject-specific QA
  • First Publicly Released Education-Specific Pretrained Model
    This is the first open-source pretrained model built directly on a rigorously validated synthetic dataset for education, offering deep and comprehensive STEM coverage. abilities

Intended Uses

  • Continual pre-training or fine-tuning for educational applications (STEM-focused tutoring, QA systems, curriculum support)
  • Benchmarking reasoning and subject-specific QA performance
  • Research into synthetic datasetโ€“driven LLM training

Model Details

Model Description

  • Developed by: Qvac by Tether
  • Model type: Decoder-only Transformer (causal LM)
  • Language(s) (NLP): Primarily English
  • License: Apache-2.0
  • Finetuned from model: None (trained from scratch)
  • Intended stage: Base pre-trained model (no SFT / RLHF alignment)

Dataset Details


Uses

Direct Use

  • General language modeling: next-token prediction, continuation, summarization, drafting.
  • Research baseline for scaling, data ablations, or tokenizer studies.

Downstream Use (recommended)

  • CPT Continued Pre-Training on more tokens.
  • SFT for assistants, domain experts, or task-specific models.
  • Preference optimization / RLHF for safer, more helpful behavior.
  • Adapters/LoRA for efficient domain specialization.

Out-of-Scope Use

  • High-stakes decision-making (medical/financial/legal).
  • Safety-critical or autonomous control systems.
  • Unfiltered end-user chat deployment without alignment / safety layers.
  • Any use that violates applicable laws or platform policies.

Bias, Risks, and Limitations

  • Bias & toxicity: May reflect or amplify biases present in web text.
  • Hallucinations: Can produce confident but incorrect statements or citations.
  • Security / privacy: May emit continous random strings.
  • Context limit: 4,096 tokens; longer inputs require chunking.

Recommendations

  • Disclose limitations to downstream users.
  • Research Model : Not to be used in production use cases.

How to Get Started

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "qvac/genesisI-model"

tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,   # trained with BF16 mixed precision
    device_map="auto"
)

prompt = "Explain precision vs. recall in one paragraph."
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
    **inputs,
    max_new_tokens=256,
    do_sample=True,
    top_p=0.9,
    temperature=0.7
)
print(tok.decode(out[0], skip_special_tokens=True))

Tip: On consumer GPUs, consider loading in float16 or using 4/8-bit quantization (e.g., bitsandbytes/AutoGPTQ).


Training Details

Training Data

  • Size: ~40B tokens, single epoch.
  • Domains: Mixed general + STEM/technical sources (expository text, problem sets, references).
  • Format: Hugging Face Datasets (Arrow).
  • Tokenizer: Qwen3 tokenizer.
  • Processing: Normalization, filtering of extremes, document chunking to fit 4096 context, sequence packing where applicable.
  • Dataset Card: Coming Soon

Training Procedure

Preprocessing

  • Unicode normalization, whitespace cleanup, control-char stripping.
  • Length filtering; chunking to 4096; optional packing to improve throughput.

Training Hyperparameters

  • Optimizer: AdamW (ฮฒโ‚=0.9, ฮฒโ‚‚=0.95), weight decay 0.01
  • Learning rate: 2e-4 (linear warmup)
  • Warmup: 600 steps (~10% of max steps)
  • Precision: BF16 mixed precision
  • Gradient clipping: 1.0
  • Seed: 42
  • Logging: Every 50 steps
  • Eval: Every 500 steps (20 iters)
  • Checkpointing: Every 1000 steps (sharded; full optimizer/state resume)

Speeds, Sizes, Times

  • Per-GPU micro-batch: 4
  • Grad accumulation: 8
  • World size: 480 GPUs
  • Effective global batch: 4 ร— 8 ร— 480 = 15,360 samples/step
  • Step time (indicative): ~1.5 s/step (cluster/I-O dependent)

Stability & Performance

  • Activation checkpointing.
  • Fused kernels where available (fused attention/optimizer).
  • FlashAttention-2 on H100.
  • torch.compile (safe mode) after warmup stability.
  • Dynamic loss scaling to mitigate BF16 overflow.
  • Fragmentation mitigations (e.g., max_split_size_mb=512, expandable segments, GC threshold ~0.8).

Multi-Node GPU Setup

  • Cluster: ~60 nodes, each 8ร— NVIDIA H100 80GB (total 480 GPUs), ~800 GB RAM/node.

  • Scheduler: Slurm (priority partition, exclusive allocation, 72-hour limit).

  • Launch: srun + PyTorch DDP (world size 480; ranks bound via Slurm env).

  • Storage: Sharded checkpoints; periodic saves for robust resume.

  • Networking: NCCL over InfiniBand with UCX

    • NCCL_IB_DISABLE=0, NCCL_IB_HCA="mlx5*", NCCL_SOCKET_IFNAME=<ib0/enoX>, NCCL_BLOCKING_WAIT=1
    • Watchdog ~720s for fail-fast on fabric issues
  • I/O: Async dataset prefetching; pinned FS threads.

  • Observability: W&B + structured logs (throughput, TFLOPs/GPU, mem, step time).

  • Reproducibility: Fixed seeds; exact launch scripts/env logged; effective tokens/step reported.

Final checkpoint converted to Hugging Face format for plug-and-play inference.


Evaluation

Testing Data, Factors & Metrics

  • Testing data: Standard academic suites (e.g., EleutherAI LM Evaluation Harness).
  • Factors: Domain/topic (STEM vs. general), task type (multi-choice vs. open-ended).
  • Metrics: Accuracy (MCQ), EM/F1 (QA), plus task-native metrics.

Suggested suite (edit as applicable):

  • General knowledge & reasoning: MMLU (STEM subsets), ARC-E/ARC-C, HellaSwag, PIQA, Winogrande
  • Math/coding (optional): GSM8K, HumanEval
  • Reading comprehension (optional): BoolQ, RACE

Results

  • To be released with an evaluated checkpoint and harness version pin. Include tables with exact versions, seeds, and commit hashes.

Summary

  • Base LM targets broad generalization at 41B tokens.
  • Expect material gains after SFT + preference optimization for target tasks.

Technical Specifications

Model Architecture and Objective

  • Architecture: Qwen3-style decoder-only Transformer
  • Parameters: ~1.7B
  • Context length: 4,096 tokens
  • Positional encoding: Rotary / relative (specify)
  • Attention: Multi-head scaled dot-product; FlashAttention-2 enabled on H100
  • Activation: GELU / SiLU (specify)
  • Norms: RMSNorm / LayerNorm (specify)
  • Objective: Causal LM (next-token prediction)

Compute Infrastructure

Hardware

  • 60 nodes ร— 8ร— H100 80GB, ~800 GB RAM/node, InfiniBand fabric.

Software

  • PyTorch โ‰ฅ 2.1 (CUDA 12.x), FlashAttention-2, UCX/NCCL
  • Slurm for orchestration; W&B for logging
  • (Optional) DeepSpeed/Zero-3 for training; HF conversion post-train

Reproducibility (Launch Sketch)

# Slurm (illustrative)
srun -N 60 -n 480 --ntasks-per-node=8 --gpus-per-task=1 \
  --cpus-per-task=8 --mem=0 \
  bash -lc '
  export NCCL_IB_DISABLE=0
  export NCCL_IB_HCA="mlx5*"
  export NCCL_SOCKET_IFNAME=ib0
  export NCCL_BLOCKING_WAIT=1
  export TORCH_DISTRIBUTED_DEBUG=DETAIL

  python train.py \
    --model qwen3_1p7b_from_scratch \
    --tokenizer qwen3 \
    --data_path /path/to/arrow \
    --context_length 4096 \
    --optimizer adamw --weight_decay 0.01 \
    --lr 2e-4 --warmup_steps 600 \
    --precision bf16-mixed \
    --micro_batch_size 4 \
    --grad_accum_steps 8 \
    --eval_every 500 --log_every 50 \
    --ckpt_every 1000 \
    --activation_checkpointing \
    --flash_attn 2 \
    --compile safe \
    --seed 42
'

Conversion & Inference

  • Checkpoints are HF-compatible: load with AutoModelForCausalLM.
  • For memory-limited environments, prefer half-precision or 4/8-bit loading.
  • Distribute as safetensors for integrity.

Changelog

  • v0.1 (2025-11-17): Initial public release โ€” 40B-token 1-epoch pretrain; HF conversion.
Downloads last month
69
Safetensors
Model size
2B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ 2 Ask for provider support

Space using qvac/genesis-i-model 1