---
language:
- en
license: apache-2.0
tags:
- text-generation
- transformer
- gpt
- maxtext
- language-model
- causal-lm
datasets:
- mlfoundations/dclm-baseline-1.0-parquet
model_type: gpt
pipeline_tag: text-generation
---

# Kisoku 3.2B

<div align="center">

**A 3.2 billion parameter GPT-style language model trained from scratch on Google Cloud TPU v4-32**

[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Model Size](https://img.shields.io/badge/Parameters-3.2B-green.svg)](https://huggingface.co/0arch-io/kisoku-3.2b-base)
[![Training](https://img.shields.io/badge/Training-TPU%20v4--32-orange.svg)](https://cloud.google.com/tpu)

</div>

## Model Overview

Kisoku 3.2B is a transformer-based causal language model trained on high-quality web text using Google's MaxText framework. The model employs Grouped Query Attention (GQA) for efficient inference and is optimized for TPU hardware.

### Key Features

- **3.2 billion parameters** with efficient GQA architecture
- **Trained on DCLM-Baseline 1.0** - curated, high-quality web text
- **100,000 training steps** achieving a final loss of 2.733
- **Native TPU optimization** with MaxText/JAX framework
- **Apache 2.0 licensed** for commercial and research use

## Model Details

### Architecture

| Component | Configuration |
|-----------|--------------|
| **Model Type** | Autoregressive Transformer (GPT-style) |
| **Parameters** | 3.2 billion |
| **Embedding Dimension** | 3,072 |
| **Attention Heads** | 32 query heads, 8 KV heads (GQA) |
| **Head Dimension** | 96 |
| **MLP Hidden Dimension** | 8,192 |
| **Decoder Layers** | 32 |
| **Vocabulary Size** | 50,304 (GPT-2 tokenizer) |
| **Max Sequence Length** | 2,048 tokens |
| **Activation Function** | GeLU |

### Training Details

**Dataset**: [DCLM-Baseline 1.0](https://huggingface.co/datasets/mlfoundations/dclm-baseline-1.0-parquet)
- High-quality filtered web text from DataComp
- Curated for factuality and coherence
- Primarily English language content

**Training Configuration**:
- **Total Steps**: 100,000
- **Global Batch Size**: 64 (16 per device × 4 hosts)
- **Sequence Length**: 2,048 tokens
- **Learning Rate**: 2e-4 (initial)
- **Optimizer**: AdamW
- **Training Duration**: ~5 days on TPU v4-32
- **Checkpoint Frequency**: Every 5,000 steps
- **Final Training Loss**: 2.733

**Hardware & Performance**:
- **TPU Type**: v4-32 (4 hosts, 32 chips total)
- **Region**: us-central2-b (Google Cloud)
- **Throughput**: ~115 TFLOP/s per device
- **Tokens/Second**: ~5,400 per device
- **Training Framework**: MaxText (JAX/Flax)

## Usage

### Loading with MaxText

```bash
# Clone MaxText
git clone https://github.com/google/maxtext.git
cd maxtext

# Run inference
python MaxText/decode.py \
  MaxText/configs/base.yml \
  load_parameters_path=gs://your-bucket/kisoku-3.2b/checkpoints/99999/items \
  base_emb_dim=3072 \
  base_num_query_heads=32 \
  base_num_kv_heads=8 \
  base_mlp_dim=8192 \
  base_num_decoder_layers=32 \
  head_dim=96 \
  vocab_size=50304 \
  tokenizer_path=gpt2 \
  max_target_length=2048 \
  prompt="Your prompt here"
```

### Conversion to HuggingFace Format

A conversion script for transformers-compatible format is coming soon.

## Limitations and Ethical Considerations

### Known Limitations

- **Base model only**: This model has not been instruction-tuned or aligned with human preferences
- **May generate harmful content**: Without safety fine-tuning, the model may produce biased, toxic, or factually incorrect text
- **English-centric**: Primarily trained on English text with limited multilingual capability
- **Context window**: Limited to 2,048 tokens
- **Not production-ready**: Requires fine-tuning and safety evaluation before deployment

### Recommended Use Cases

✅ **Appropriate Uses**:
- Research on language model behavior and capabilities
- Fine-tuning for specific downstream tasks
- Educational purposes and ML experimentation
- Building aligned models with additional training

❌ **Not Recommended**:
- Direct deployment in user-facing applications without fine-tuning
- Use cases requiring factual accuracy without verification
- Applications involving sensitive content or high-stakes decisions
- Scenarios where harmful outputs could cause real-world harm

### Bias and Fairness

This model was trained on web-scraped data and may reflect biases present in the training corpus. Users should evaluate the model for bias and fairness issues specific to their use case before deployment.

## Training Infrastructure

This model was trained using resources from the [TRC (TPU Research Cloud)](https://sites.research.google/trc/) program, which provides free TPU access to researchers.

- **Cloud Provider**: Google Cloud Platform
- **TPU Type**: v4-32 (32 chips across 4 hosts)
- **Framework**: [MaxText](https://github.com/google/maxtext) (JAX/Flax)
- **Region**: us-central2-b

## Citation

If you use this model in your research, please cite:

```bibtex
@software{kisoku2025,
  title={Kisoku: A 3.2B Parameter Language Model},
  author={Rodriguez, Joseph},
  year={2025},
  url={https://huggingface.co/0arch-io/kisoku-3.2b-base},
  note={Trained using Google TRC program}
}
```

## Acknowledgments

- **Google TRC Program** for providing TPU compute resources
- **Google MaxText Team** for the training framework
- **DataComp Team** for the DCLM-Baseline 1.0 dataset
- **Open source community** for tools and libraries

## Model Card Contact

**Maintainer**: Joseph Rodriguez
**Email**: contact@0arch.io
**Organization**: 0ARCH

For questions, issues, or collaboration inquiries, please reach out via email or open an issue on the model repository.

---

<div align="center">

*Trained with [MaxText](https://github.com/google/maxtext) • Powered by [Google Cloud TPUs](https://cloud.google.com/tpu)*

</div>