--- language: - en license: apache-2.0 tags: - text-generation - transformer - gpt - maxtext - language-model - causal-lm datasets: - mlfoundations/dclm-baseline-1.0-parquet model_type: gpt pipeline_tag: text-generation --- # Kisoku 3.2B
**A 3.2 billion parameter GPT-style language model trained from scratch on Google Cloud TPU v4-32** [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Model Size](https://img.shields.io/badge/Parameters-3.2B-green.svg)](https://huggingface.co/0arch-io/kisoku-3.2b-base) [![Training](https://img.shields.io/badge/Training-TPU%20v4--32-orange.svg)](https://cloud.google.com/tpu)
## Model Overview Kisoku 3.2B is a transformer-based causal language model trained on high-quality web text using Google's MaxText framework. The model employs Grouped Query Attention (GQA) for efficient inference and is optimized for TPU hardware. ### Key Features - **3.2 billion parameters** with efficient GQA architecture - **Trained on DCLM-Baseline 1.0** - curated, high-quality web text - **100,000 training steps** achieving a final loss of 2.733 - **Native TPU optimization** with MaxText/JAX framework - **Apache 2.0 licensed** for commercial and research use ## Model Details ### Architecture | Component | Configuration | |-----------|--------------| | **Model Type** | Autoregressive Transformer (GPT-style) | | **Parameters** | 3.2 billion | | **Embedding Dimension** | 3,072 | | **Attention Heads** | 32 query heads, 8 KV heads (GQA) | | **Head Dimension** | 96 | | **MLP Hidden Dimension** | 8,192 | | **Decoder Layers** | 32 | | **Vocabulary Size** | 50,304 (GPT-2 tokenizer) | | **Max Sequence Length** | 2,048 tokens | | **Activation Function** | GeLU | ### Training Details **Dataset**: [DCLM-Baseline 1.0](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet) - High-quality filtered web text from DataComp - Curated for factuality and coherence - Primarily English language content **Training Configuration**: - **Total Steps**: 100,000 - **Global Batch Size**: 64 (16 per device × 4 hosts) - **Sequence Length**: 2,048 tokens - **Learning Rate**: 2e-4 (initial) - **Optimizer**: AdamW - **Training Duration**: ~5 days on TPU v4-32 - **Checkpoint Frequency**: Every 5,000 steps - **Final Training Loss**: 2.733 **Hardware & Performance**: - **TPU Type**: v4-32 (4 hosts, 32 chips total) - **Region**: us-central2-b (Google Cloud) - **Throughput**: ~115 TFLOP/s per device - **Tokens/Second**: ~5,400 per device - **Training Framework**: MaxText (JAX/Flax) ## Usage ### Loading with MaxText ```bash # Clone MaxText git clone https://github.com/google/maxtext.git cd maxtext # Run inference python MaxText/decode.py \ MaxText/configs/base.yml \ load_parameters_path=gs://your-bucket/kisoku-3.2b/checkpoints/99999/items \ base_emb_dim=3072 \ base_num_query_heads=32 \ base_num_kv_heads=8 \ base_mlp_dim=8192 \ base_num_decoder_layers=32 \ head_dim=96 \ vocab_size=50304 \ tokenizer_path=gpt2 \ max_target_length=2048 \ prompt="Your prompt here" ``` ### Conversion to HuggingFace Format A conversion script for transformers-compatible format is coming soon. ## Limitations and Ethical Considerations ### Known Limitations - **Base model only**: This model has not been instruction-tuned or aligned with human preferences - **May generate harmful content**: Without safety fine-tuning, the model may produce biased, toxic, or factually incorrect text - **English-centric**: Primarily trained on English text with limited multilingual capability - **Context window**: Limited to 2,048 tokens - **Not production-ready**: Requires fine-tuning and safety evaluation before deployment ### Recommended Use Cases ✅ **Appropriate Uses**: - Research on language model behavior and capabilities - Fine-tuning for specific downstream tasks - Educational purposes and ML experimentation - Building aligned models with additional training ❌ **Not Recommended**: - Direct deployment in user-facing applications without fine-tuning - Use cases requiring factual accuracy without verification - Applications involving sensitive content or high-stakes decisions - Scenarios where harmful outputs could cause real-world harm ### Bias and Fairness This model was trained on web-scraped data and may reflect biases present in the training corpus. Users should evaluate the model for bias and fairness issues specific to their use case before deployment. ## Training Infrastructure This model was trained using resources from the [TRC (TPU Research Cloud)](https://sites.research.google/trc/) program, which provides free TPU access to researchers. - **Cloud Provider**: Google Cloud Platform - **TPU Type**: v4-32 (32 chips across 4 hosts) - **Framework**: [MaxText](https://github.com/google/maxtext) (JAX/Flax) - **Region**: us-central2-b ## Citation If you use this model in your research, please cite: ```bibtex @software{kisoku2025, title={Kisoku: A 3.2B Parameter Language Model}, author={Rodriguez, Joseph}, year={2025}, url={https://huggingface.co/0arch-io/kisoku-3.2b-base}, note={Trained using Google TRC program} } ``` ## Acknowledgments - **Google TRC Program** for providing TPU compute resources - **Google MaxText Team** for the training framework - **DataComp Team** for the DCLM-Baseline 1.0 dataset - **Open source community** for tools and libraries ## Model Card Contact **Maintainer**: Joseph Rodriguez **Email**: contact@0arch.io **Organization**: 0ARCH For questions, issues, or collaboration inquiries, please reach out via email or open an issue on the model repository. ---
*Trained with [MaxText](https://github.com/google/maxtext) • Powered by [Google Cloud TPUs](https://cloud.google.com/tpu)*