Kisoku 3.2B
Model Overview
Kisoku 3.2B is a transformer-based causal language model trained on high-quality web text using Google's MaxText framework. The model employs Grouped Query Attention (GQA) for efficient inference and is optimized for TPU hardware.
Key Features
- 3.2 billion parameters with efficient GQA architecture
- Trained on DCLM-Baseline 1.0 - curated, high-quality web text
- 100,000 training steps achieving a final loss of 2.733
- Native TPU optimization with MaxText/JAX framework
- Apache 2.0 licensed for commercial and research use
Model Details
Architecture
| Component | Configuration |
|---|---|
| Model Type | Autoregressive Transformer (GPT-style) |
| Parameters | 3.2 billion |
| Embedding Dimension | 3,072 |
| Attention Heads | 32 query heads, 8 KV heads (GQA) |
| Head Dimension | 96 |
| MLP Hidden Dimension | 8,192 |
| Decoder Layers | 32 |
| Vocabulary Size | 50,304 (GPT-2 tokenizer) |
| Max Sequence Length | 2,048 tokens |
| Activation Function | GeLU |
Training Details
Dataset: DCLM-Baseline 1.0
- High-quality filtered web text from DataComp
- Curated for factuality and coherence
- Primarily English language content
Training Configuration:
- Total Steps: 100,000
- Global Batch Size: 64 (16 per device × 4 hosts)
- Sequence Length: 2,048 tokens
- Learning Rate: 2e-4 (initial)
- Optimizer: AdamW
- Training Duration: ~5 days on TPU v4-32
- Checkpoint Frequency: Every 5,000 steps
- Final Training Loss: 2.733
Hardware & Performance:
- TPU Type: v4-32 (4 hosts, 32 chips total)
- Region: us-central2-b (Google Cloud)
- Throughput: ~115 TFLOP/s per device
- Tokens/Second: ~5,400 per device
- Training Framework: MaxText (JAX/Flax)
Usage
Loading with MaxText
# Clone MaxText
git clone https://github.com/google/maxtext.git
cd maxtext
# Run inference
python MaxText/decode.py \
MaxText/configs/base.yml \
load_parameters_path=gs://your-bucket/kisoku-3.2b/checkpoints/99999/items \
base_emb_dim=3072 \
base_num_query_heads=32 \
base_num_kv_heads=8 \
base_mlp_dim=8192 \
base_num_decoder_layers=32 \
head_dim=96 \
vocab_size=50304 \
tokenizer_path=gpt2 \
max_target_length=2048 \
prompt="Your prompt here"
Conversion to HuggingFace Format
A conversion script for transformers-compatible format is coming soon.
Limitations and Ethical Considerations
Known Limitations
- Base model only: This model has not been instruction-tuned or aligned with human preferences
- May generate harmful content: Without safety fine-tuning, the model may produce biased, toxic, or factually incorrect text
- English-centric: Primarily trained on English text with limited multilingual capability
- Context window: Limited to 2,048 tokens
- Not production-ready: Requires fine-tuning and safety evaluation before deployment
Recommended Use Cases
✅ Appropriate Uses:
- Research on language model behavior and capabilities
- Fine-tuning for specific downstream tasks
- Educational purposes and ML experimentation
- Building aligned models with additional training
❌ Not Recommended:
- Direct deployment in user-facing applications without fine-tuning
- Use cases requiring factual accuracy without verification
- Applications involving sensitive content or high-stakes decisions
- Scenarios where harmful outputs could cause real-world harm
Bias and Fairness
This model was trained on web-scraped data and may reflect biases present in the training corpus. Users should evaluate the model for bias and fairness issues specific to their use case before deployment.
Training Infrastructure
This model was trained using resources from the TRC (TPU Research Cloud) program, which provides free TPU access to researchers.
- Cloud Provider: Google Cloud Platform
- TPU Type: v4-32 (32 chips across 4 hosts)
- Framework: MaxText (JAX/Flax)
- Region: us-central2-b
Citation
If you use this model in your research, please cite:
@software{kisoku2025,
title={Kisoku: A 3.2B Parameter Language Model},
author={Rodriguez, Joseph},
year={2025},
url={https://huggingface.co/0arch-io/kisoku-3.2b-base},
note={Trained using Google TRC program}
}
Acknowledgments
- Google TRC Program for providing TPU compute resources
- Google MaxText Team for the training framework
- DataComp Team for the DCLM-Baseline 1.0 dataset
- Open source community for tools and libraries
Model Card Contact
Maintainer: Joseph Rodriguez Email: [email protected] Organization: 0ARCH
For questions, issues, or collaboration inquiries, please reach out via email or open an issue on the model repository.
Trained with MaxText • Powered by Google Cloud TPUs