QVAC Genesis I Pretrained Model
Key Highlights
Pretrained on the Largest Synthetic Educational Dataset
This model has been pretrained on Tether's QVAC Genesis I, the largest synthetic dataset released for educational LLM pre-training.The model was trained from scratch on approximately 40B tokens of multi-domain educational text, using BF16 mixed precision and a 4,096-token context window. Training was made with a Qwen3-family 1.7B-parameter decoder-only transformer architecture.
Checkpoints are provided in standard Hugging Face format for easy inference, continual pre-training, and fine-tuning.
Multi-Domain Educational Coverage
Because the model is trained on QVAC Genesis I, it inherits curriculum-aligned coverage across:- Mathematics
- Physics
- Biology
- Medicine
Superior Benchmark Performance
Leveraging QVAC Genesis I as its training foundation, the model consistently outperforms baselines in:- Reasoning tasks
- Knowledge assessments
- Subject-specific QA
First Publicly Released Education-Specific Pretrained Model
This is the first open-source pretrained model built directly on a rigorously validated synthetic dataset for education, offering deep and comprehensive STEM coverage. abilities
Intended Uses
- Continual pre-training or fine-tuning for educational applications (STEM-focused tutoring, QA systems, curriculum support)
- Benchmarking reasoning and subject-specific QA performance
- Research into synthetic datasetโdriven LLM training
Model Details
Model Description
- Developed by: Qvac by Tether
- Model type: Decoder-only Transformer (causal LM)
- Language(s) (NLP): Primarily English
- License: Apache-2.0
- Finetuned from model: None (trained from scratch)
- Intended stage: Base pre-trained model (no SFT / RLHF alignment)
Dataset Details
- Repository: https://huggingface.co/qvac/genesisI-model
- Paper / Blog : https://huggingface.co/blog/qvac/genesis-i
Uses
Direct Use
- General language modeling: next-token prediction, continuation, summarization, drafting.
- Research baseline for scaling, data ablations, or tokenizer studies.
Downstream Use (recommended)
- CPT Continued Pre-Training on more tokens.
- SFT for assistants, domain experts, or task-specific models.
- Preference optimization / RLHF for safer, more helpful behavior.
- Adapters/LoRA for efficient domain specialization.
Out-of-Scope Use
- High-stakes decision-making (medical/financial/legal).
- Safety-critical or autonomous control systems.
- Unfiltered end-user chat deployment without alignment / safety layers.
- Any use that violates applicable laws or platform policies.
Bias, Risks, and Limitations
- Bias & toxicity: May reflect or amplify biases present in web text.
- Hallucinations: Can produce confident but incorrect statements or citations.
- Security / privacy: May emit continous random strings.
- Context limit: 4,096 tokens; longer inputs require chunking.
Recommendations
- Disclose limitations to downstream users.
- Research Model : Not to be used in production use cases.
How to Get Started
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "qvac/genesisI-model"
tok = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16, # trained with BF16 mixed precision
device_map="auto"
)
prompt = "Explain precision vs. recall in one paragraph."
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(
**inputs,
max_new_tokens=256,
do_sample=True,
top_p=0.9,
temperature=0.7
)
print(tok.decode(out[0], skip_special_tokens=True))
Tip: On consumer GPUs, consider loading in float16 or using 4/8-bit quantization (e.g., bitsandbytes/AutoGPTQ).
Training Details
Training Data
- Size: ~40B tokens, single epoch.
- Domains: Mixed general + STEM/technical sources (expository text, problem sets, references).
- Format: Hugging Face Datasets (Arrow).
- Tokenizer: Qwen3 tokenizer.
- Processing: Normalization, filtering of extremes, document chunking to fit 4096 context, sequence packing where applicable.
- Dataset Card: Coming Soon
Training Procedure
Preprocessing
- Unicode normalization, whitespace cleanup, control-char stripping.
- Length filtering; chunking to 4096; optional packing to improve throughput.
Training Hyperparameters
- Optimizer: AdamW (ฮฒโ=0.9, ฮฒโ=0.95), weight decay 0.01
- Learning rate: 2e-4 (linear warmup)
- Warmup: 600 steps (~10% of max steps)
- Precision: BF16 mixed precision
- Gradient clipping: 1.0
- Seed: 42
- Logging: Every 50 steps
- Eval: Every 500 steps (20 iters)
- Checkpointing: Every 1000 steps (sharded; full optimizer/state resume)
Speeds, Sizes, Times
- Per-GPU micro-batch: 4
- Grad accumulation: 8
- World size: 480 GPUs
- Effective global batch:
4 ร 8 ร 480 = 15,360samples/step - Step time (indicative): ~1.5 s/step (cluster/I-O dependent)
Stability & Performance
- Activation checkpointing.
- Fused kernels where available (fused attention/optimizer).
- FlashAttention-2 on H100.
torch.compile(safe mode) after warmup stability.- Dynamic loss scaling to mitigate BF16 overflow.
- Fragmentation mitigations (e.g.,
max_split_size_mb=512, expandable segments, GC threshold ~0.8).
Multi-Node GPU Setup
Cluster: ~60 nodes, each 8ร NVIDIA H100 80GB (total 480 GPUs), ~800 GB RAM/node.
Scheduler: Slurm (priority partition, exclusive allocation, 72-hour limit).
Launch:
srun+ PyTorch DDP (world size 480; ranks bound via Slurm env).Storage: Sharded checkpoints; periodic saves for robust resume.
Networking: NCCL over InfiniBand with UCX
NCCL_IB_DISABLE=0,NCCL_IB_HCA="mlx5*",NCCL_SOCKET_IFNAME=<ib0/enoX>,NCCL_BLOCKING_WAIT=1- Watchdog ~720s for fail-fast on fabric issues
I/O: Async dataset prefetching; pinned FS threads.
Observability: W&B + structured logs (throughput, TFLOPs/GPU, mem, step time).
Reproducibility: Fixed seeds; exact launch scripts/env logged; effective tokens/step reported.
Final checkpoint converted to Hugging Face format for plug-and-play inference.
Evaluation
Testing Data, Factors & Metrics
- Testing data: Standard academic suites (e.g., EleutherAI LM Evaluation Harness).
- Factors: Domain/topic (STEM vs. general), task type (multi-choice vs. open-ended).
- Metrics: Accuracy (MCQ), EM/F1 (QA), plus task-native metrics.
Suggested suite (edit as applicable):
- General knowledge & reasoning: MMLU (STEM subsets), ARC-E/ARC-C, HellaSwag, PIQA, Winogrande
- Math/coding (optional): GSM8K, HumanEval
- Reading comprehension (optional): BoolQ, RACE
Results
- To be released with an evaluated checkpoint and harness version pin. Include tables with exact versions, seeds, and commit hashes.
Summary
- Base LM targets broad generalization at 41B tokens.
- Expect material gains after SFT + preference optimization for target tasks.
Technical Specifications
Model Architecture and Objective
- Architecture: Qwen3-style decoder-only Transformer
- Parameters: ~1.7B
- Context length: 4,096 tokens
- Positional encoding: Rotary / relative (specify)
- Attention: Multi-head scaled dot-product; FlashAttention-2 enabled on H100
- Activation: GELU / SiLU (specify)
- Norms: RMSNorm / LayerNorm (specify)
- Objective: Causal LM (next-token prediction)
Compute Infrastructure
Hardware
- 60 nodes ร 8ร H100 80GB, ~800 GB RAM/node, InfiniBand fabric.
Software
- PyTorch โฅ 2.1 (CUDA 12.x), FlashAttention-2, UCX/NCCL
- Slurm for orchestration; W&B for logging
- (Optional) DeepSpeed/Zero-3 for training; HF conversion post-train
Reproducibility (Launch Sketch)
# Slurm (illustrative)
srun -N 60 -n 480 --ntasks-per-node=8 --gpus-per-task=1 \
--cpus-per-task=8 --mem=0 \
bash -lc '
export NCCL_IB_DISABLE=0
export NCCL_IB_HCA="mlx5*"
export NCCL_SOCKET_IFNAME=ib0
export NCCL_BLOCKING_WAIT=1
export TORCH_DISTRIBUTED_DEBUG=DETAIL
python train.py \
--model qwen3_1p7b_from_scratch \
--tokenizer qwen3 \
--data_path /path/to/arrow \
--context_length 4096 \
--optimizer adamw --weight_decay 0.01 \
--lr 2e-4 --warmup_steps 600 \
--precision bf16-mixed \
--micro_batch_size 4 \
--grad_accum_steps 8 \
--eval_every 500 --log_every 50 \
--ckpt_every 1000 \
--activation_checkpointing \
--flash_attn 2 \
--compile safe \
--seed 42
'
Conversion & Inference
- Checkpoints are HF-compatible: load with
AutoModelForCausalLM. - For memory-limited environments, prefer half-precision or 4/8-bit loading.
- Distribute as
safetensorsfor integrity.
Changelog
- v0.1 (2025-11-17): Initial public release โ 40B-token 1-epoch pretrain; HF conversion.
- Downloads last month
- 69