---
language:
- en
license: apache-2.0
tags:
- text-generation
- transformer
- gpt
- maxtext
- language-model
- causal-lm
datasets:
- mlfoundations/dclm-baseline-1.0-parquet
model_type: gpt
pipeline_tag: text-generation
---
# Kisoku 3.2B
**A 3.2 billion parameter GPT-style language model trained from scratch on Google Cloud TPU v4-32**
[](https://opensource.org/licenses/Apache-2.0)
[](https://huggingface.co/0arch-io/kisoku-3.2b-base)
[](https://cloud.google.com/tpu)
## Model Overview
Kisoku 3.2B is a transformer-based causal language model trained on high-quality web text using Google's MaxText framework. The model employs Grouped Query Attention (GQA) for efficient inference and is optimized for TPU hardware.
### Key Features
- **3.2 billion parameters** with efficient GQA architecture
- **Trained on DCLM-Baseline 1.0** - curated, high-quality web text
- **100,000 training steps** achieving a final loss of 2.733
- **Native TPU optimization** with MaxText/JAX framework
- **Apache 2.0 licensed** for commercial and research use
## Model Details
### Architecture
| Component | Configuration |
|-----------|--------------|
| **Model Type** | Autoregressive Transformer (GPT-style) |
| **Parameters** | 3.2 billion |
| **Embedding Dimension** | 3,072 |
| **Attention Heads** | 32 query heads, 8 KV heads (GQA) |
| **Head Dimension** | 96 |
| **MLP Hidden Dimension** | 8,192 |
| **Decoder Layers** | 32 |
| **Vocabulary Size** | 50,304 (GPT-2 tokenizer) |
| **Max Sequence Length** | 2,048 tokens |
| **Activation Function** | GeLU |
### Training Details
**Dataset**: [DCLM-Baseline 1.0](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet)
- High-quality filtered web text from DataComp
- Curated for factuality and coherence
- Primarily English language content
**Training Configuration**:
- **Total Steps**: 100,000
- **Global Batch Size**: 64 (16 per device × 4 hosts)
- **Sequence Length**: 2,048 tokens
- **Learning Rate**: 2e-4 (initial)
- **Optimizer**: AdamW
- **Training Duration**: ~5 days on TPU v4-32
- **Checkpoint Frequency**: Every 5,000 steps
- **Final Training Loss**: 2.733
**Hardware & Performance**:
- **TPU Type**: v4-32 (4 hosts, 32 chips total)
- **Region**: us-central2-b (Google Cloud)
- **Throughput**: ~115 TFLOP/s per device
- **Tokens/Second**: ~5,400 per device
- **Training Framework**: MaxText (JAX/Flax)
## Usage
### Loading with MaxText
```bash
# Clone MaxText
git clone https://github.com/google/maxtext.git
cd maxtext
# Run inference
python MaxText/decode.py \
MaxText/configs/base.yml \
load_parameters_path=gs://your-bucket/kisoku-3.2b/checkpoints/99999/items \
base_emb_dim=3072 \
base_num_query_heads=32 \
base_num_kv_heads=8 \
base_mlp_dim=8192 \
base_num_decoder_layers=32 \
head_dim=96 \
vocab_size=50304 \
tokenizer_path=gpt2 \
max_target_length=2048 \
prompt="Your prompt here"
```
### Conversion to HuggingFace Format
A conversion script for transformers-compatible format is coming soon.
## Limitations and Ethical Considerations
### Known Limitations
- **Base model only**: This model has not been instruction-tuned or aligned with human preferences
- **May generate harmful content**: Without safety fine-tuning, the model may produce biased, toxic, or factually incorrect text
- **English-centric**: Primarily trained on English text with limited multilingual capability
- **Context window**: Limited to 2,048 tokens
- **Not production-ready**: Requires fine-tuning and safety evaluation before deployment
### Recommended Use Cases
✅ **Appropriate Uses**:
- Research on language model behavior and capabilities
- Fine-tuning for specific downstream tasks
- Educational purposes and ML experimentation
- Building aligned models with additional training
❌ **Not Recommended**:
- Direct deployment in user-facing applications without fine-tuning
- Use cases requiring factual accuracy without verification
- Applications involving sensitive content or high-stakes decisions
- Scenarios where harmful outputs could cause real-world harm
### Bias and Fairness
This model was trained on web-scraped data and may reflect biases present in the training corpus. Users should evaluate the model for bias and fairness issues specific to their use case before deployment.
## Training Infrastructure
This model was trained using resources from the [TRC (TPU Research Cloud)](https://sites.research.google/trc/) program, which provides free TPU access to researchers.
- **Cloud Provider**: Google Cloud Platform
- **TPU Type**: v4-32 (32 chips across 4 hosts)
- **Framework**: [MaxText](https://github.com/google/maxtext) (JAX/Flax)
- **Region**: us-central2-b
## Citation
If you use this model in your research, please cite:
```bibtex
@software{kisoku2025,
title={Kisoku: A 3.2B Parameter Language Model},
author={Rodriguez, Joseph},
year={2025},
url={https://huggingface.co/0arch-io/kisoku-3.2b-base},
note={Trained using Google TRC program}
}
```
## Acknowledgments
- **Google TRC Program** for providing TPU compute resources
- **Google MaxText Team** for the training framework
- **DataComp Team** for the DCLM-Baseline 1.0 dataset
- **Open source community** for tools and libraries
## Model Card Contact
**Maintainer**: Joseph Rodriguez
**Email**: contact@0arch.io
**Organization**: 0ARCH
For questions, issues, or collaboration inquiries, please reach out via email or open an issue on the model repository.
---
*Trained with [MaxText](https://github.com/google/maxtext) • Powered by [Google Cloud TPUs](https://cloud.google.com/tpu)*