Update README.md

#4
by axay - opened
Files changed (1) hide show
  1. README.md +4 -3
README.md CHANGED
@@ -16,7 +16,7 @@ pipeline_tag: text-generation
16
  - **Pretrained on the Largest Synthetic Educational Dataset**
17
  This model has been **pretrained on Tether's QVAC Genesis I**, the largest synthetic dataset released for educational LLM pre-training.
18
 
19
- The model was trained **from scratch** on approximately **41B tokens** of multi-domain educational text, using **BF16 mixed precision** and a **4,096-token context window**. Training was made with a **Qwen3-family 1.7B-parameter decoder-only transformer** architecture.
20
 
21
  Checkpoints are provided in standard Hugging Face format for easy inference, continual pre-training, and fine-tuning.
22
 
@@ -55,7 +55,7 @@ abilities
55
  - **Finetuned from model:** **None (trained from scratch)**
56
  - **Intended stage:** **Base pre-trained model** (no SFT / RLHF alignment)
57
 
58
- ### Model Sources
59
 
60
  - **Repository:** https://huggingface.co/qvac/genesisI-model
61
  - **Paper / Blog :** https://huggingface.co/blog/qvac/genesis-i
@@ -70,7 +70,7 @@ abilities
70
  - Research baseline for scaling, data ablations, or tokenizer studies.
71
 
72
  ### Downstream Use (recommended)
73
-
74
  - **SFT** for assistants, domain experts, or task-specific models.
75
  - **Preference optimization / RLHF** for safer, more helpful behavior.
76
  - **Adapters/LoRA** for efficient domain specialization.
@@ -94,6 +94,7 @@ abilities
94
  ### Recommendations
95
 
96
  - Disclose limitations to downstream users.
 
97
 
98
  ---
99
 
 
16
  - **Pretrained on the Largest Synthetic Educational Dataset**
17
  This model has been **pretrained on Tether's QVAC Genesis I**, the largest synthetic dataset released for educational LLM pre-training.
18
 
19
+ The model was trained **from scratch** on approximately **40B tokens** of multi-domain educational text, using **BF16 mixed precision** and a **4,096-token context window**. Training was made with a **Qwen3-family 1.7B-parameter decoder-only transformer** architecture.
20
 
21
  Checkpoints are provided in standard Hugging Face format for easy inference, continual pre-training, and fine-tuning.
22
 
 
55
  - **Finetuned from model:** **None (trained from scratch)**
56
  - **Intended stage:** **Base pre-trained model** (no SFT / RLHF alignment)
57
 
58
+ ### Dataset Details
59
 
60
  - **Repository:** https://huggingface.co/qvac/genesisI-model
61
  - **Paper / Blog :** https://huggingface.co/blog/qvac/genesis-i
 
70
  - Research baseline for scaling, data ablations, or tokenizer studies.
71
 
72
  ### Downstream Use (recommended)
73
+ - **CPT** Continued Pre-Training on more tokens.
74
  - **SFT** for assistants, domain experts, or task-specific models.
75
  - **Preference optimization / RLHF** for safer, more helpful behavior.
76
  - **Adapters/LoRA** for efficient domain specialization.
 
94
  ### Recommendations
95
 
96
  - Disclose limitations to downstream users.
97
+ - Research Model : Not to be used in production use cases.
98
 
99
  ---
100