Kisoku 3.2B

A 3.2 billion parameter GPT-style language model trained from scratch on Google Cloud TPU v4-32

Model Overview

Kisoku 3.2B is a transformer-based causal language model trained on high-quality web text using Google's MaxText framework. The model employs Grouped Query Attention (GQA) for efficient inference and is optimized for TPU hardware.

Key Features

3.2 billion parameters with efficient GQA architecture
Trained on DCLM-Baseline 1.0 - curated, high-quality web text
100,000 training steps achieving a final loss of 2.733
Native TPU optimization with MaxText/JAX framework
Apache 2.0 licensed for commercial and research use

Model Details

Architecture

Component	Configuration
Model Type	Autoregressive Transformer (GPT-style)
Parameters	3.2 billion
Embedding Dimension	3,072
Attention Heads	32 query heads, 8 KV heads (GQA)
Head Dimension	96
MLP Hidden Dimension	8,192
Decoder Layers	32
Vocabulary Size	50,304 (GPT-2 tokenizer)
Max Sequence Length	2,048 tokens
Activation Function	GeLU

Training Details

Dataset: DCLM-Baseline 1.0

High-quality filtered web text from DataComp
Curated for factuality and coherence
Primarily English language content

Training Configuration:

Total Steps: 100,000
Global Batch Size: 64 (16 per device × 4 hosts)
Sequence Length: 2,048 tokens
Learning Rate: 2e-4 (initial)
Optimizer: AdamW
Training Duration: ~5 days on TPU v4-32
Checkpoint Frequency: Every 5,000 steps
Final Training Loss: 2.733

Hardware & Performance:

TPU Type: v4-32 (4 hosts, 32 chips total)
Region: us-central2-b (Google Cloud)
Throughput: ~115 TFLOP/s per device
Tokens/Second: ~5,400 per device
Training Framework: MaxText (JAX/Flax)

Usage

Loading with MaxText

# Clone MaxText
git clone https://github.com/google/maxtext.git
cd maxtext

# Run inference
python MaxText/decode.py \
  MaxText/configs/base.yml \
  load_parameters_path=gs://your-bucket/kisoku-3.2b/checkpoints/99999/items \
  base_emb_dim=3072 \
  base_num_query_heads=32 \
  base_num_kv_heads=8 \
  base_mlp_dim=8192 \
  base_num_decoder_layers=32 \
  head_dim=96 \
  vocab_size=50304 \
  tokenizer_path=gpt2 \
  max_target_length=2048 \
  prompt="Your prompt here"

Conversion to HuggingFace Format

A conversion script for transformers-compatible format is coming soon.

Limitations and Ethical Considerations

Known Limitations

Base model only: This model has not been instruction-tuned or aligned with human preferences
May generate harmful content: Without safety fine-tuning, the model may produce biased, toxic, or factually incorrect text
English-centric: Primarily trained on English text with limited multilingual capability
Context window: Limited to 2,048 tokens
Not production-ready: Requires fine-tuning and safety evaluation before deployment

Recommended Use Cases

✅ Appropriate Uses:

Research on language model behavior and capabilities
Fine-tuning for specific downstream tasks
Educational purposes and ML experimentation
Building aligned models with additional training

❌ Not Recommended:

Direct deployment in user-facing applications without fine-tuning
Use cases requiring factual accuracy without verification
Applications involving sensitive content or high-stakes decisions
Scenarios where harmful outputs could cause real-world harm

Bias and Fairness

This model was trained on web-scraped data and may reflect biases present in the training corpus. Users should evaluate the model for bias and fairness issues specific to their use case before deployment.

Training Infrastructure

This model was trained using resources from the TRC (TPU Research Cloud) program, which provides free TPU access to researchers.

Cloud Provider: Google Cloud Platform
TPU Type: v4-32 (32 chips across 4 hosts)
Framework: MaxText (JAX/Flax)
Region: us-central2-b

Citation

If you use this model in your research, please cite:

@software{kisoku2025,
  title={Kisoku: A 3.2B Parameter Language Model},
  author={Rodriguez, Joseph},
  year={2025},
  url={https://huggingface.co/0arch-io/kisoku-3.2b-base},
  note={Trained using Google TRC program}
}

Acknowledgments

Google TRC Program for providing TPU compute resources
Google MaxText Team for the training framework
DataComp Team for the DCLM-Baseline 1.0 dataset
Open source community for tools and libraries

Model Card Contact

Maintainer: Joseph Rodriguez Email: [email protected] Organization: 0ARCH

For questions, issues, or collaboration inquiries, please reach out via email or open an issue on the model repository.

Trained with MaxText • Powered by Google Cloud TPUs

Downloads last month: -; Downloads are not tracked for this model. How to track

0arch-io
/

kisoku-3.2b-base