Kisoku 3.2B

A 3.2 billion parameter GPT-style language model trained from scratch on Google Cloud TPU v4-32

License Model Size Training

Model Overview

Kisoku 3.2B is a transformer-based causal language model trained on high-quality web text using Google's MaxText framework. The model employs Grouped Query Attention (GQA) for efficient inference and is optimized for TPU hardware.

Key Features

  • 3.2 billion parameters with efficient GQA architecture
  • Trained on DCLM-Baseline 1.0 - curated, high-quality web text
  • 100,000 training steps achieving a final loss of 2.733
  • Native TPU optimization with MaxText/JAX framework
  • Apache 2.0 licensed for commercial and research use

Model Details

Architecture

Component Configuration
Model Type Autoregressive Transformer (GPT-style)
Parameters 3.2 billion
Embedding Dimension 3,072
Attention Heads 32 query heads, 8 KV heads (GQA)
Head Dimension 96
MLP Hidden Dimension 8,192
Decoder Layers 32
Vocabulary Size 50,304 (GPT-2 tokenizer)
Max Sequence Length 2,048 tokens
Activation Function GeLU

Training Details

Dataset: DCLM-Baseline 1.0

  • High-quality filtered web text from DataComp
  • Curated for factuality and coherence
  • Primarily English language content

Training Configuration:

  • Total Steps: 100,000
  • Global Batch Size: 64 (16 per device × 4 hosts)
  • Sequence Length: 2,048 tokens
  • Learning Rate: 2e-4 (initial)
  • Optimizer: AdamW
  • Training Duration: ~5 days on TPU v4-32
  • Checkpoint Frequency: Every 5,000 steps
  • Final Training Loss: 2.733

Hardware & Performance:

  • TPU Type: v4-32 (4 hosts, 32 chips total)
  • Region: us-central2-b (Google Cloud)
  • Throughput: ~115 TFLOP/s per device
  • Tokens/Second: ~5,400 per device
  • Training Framework: MaxText (JAX/Flax)

Usage

Loading with MaxText

# Clone MaxText
git clone https://github.com/google/maxtext.git
cd maxtext

# Run inference
python MaxText/decode.py \
  MaxText/configs/base.yml \
  load_parameters_path=gs://your-bucket/kisoku-3.2b/checkpoints/99999/items \
  base_emb_dim=3072 \
  base_num_query_heads=32 \
  base_num_kv_heads=8 \
  base_mlp_dim=8192 \
  base_num_decoder_layers=32 \
  head_dim=96 \
  vocab_size=50304 \
  tokenizer_path=gpt2 \
  max_target_length=2048 \
  prompt="Your prompt here"

Conversion to HuggingFace Format

A conversion script for transformers-compatible format is coming soon.

Limitations and Ethical Considerations

Known Limitations

  • Base model only: This model has not been instruction-tuned or aligned with human preferences
  • May generate harmful content: Without safety fine-tuning, the model may produce biased, toxic, or factually incorrect text
  • English-centric: Primarily trained on English text with limited multilingual capability
  • Context window: Limited to 2,048 tokens
  • Not production-ready: Requires fine-tuning and safety evaluation before deployment

Recommended Use Cases

Appropriate Uses:

  • Research on language model behavior and capabilities
  • Fine-tuning for specific downstream tasks
  • Educational purposes and ML experimentation
  • Building aligned models with additional training

Not Recommended:

  • Direct deployment in user-facing applications without fine-tuning
  • Use cases requiring factual accuracy without verification
  • Applications involving sensitive content or high-stakes decisions
  • Scenarios where harmful outputs could cause real-world harm

Bias and Fairness

This model was trained on web-scraped data and may reflect biases present in the training corpus. Users should evaluate the model for bias and fairness issues specific to their use case before deployment.

Training Infrastructure

This model was trained using resources from the TRC (TPU Research Cloud) program, which provides free TPU access to researchers.

  • Cloud Provider: Google Cloud Platform
  • TPU Type: v4-32 (32 chips across 4 hosts)
  • Framework: MaxText (JAX/Flax)
  • Region: us-central2-b

Citation

If you use this model in your research, please cite:

@software{kisoku2025,
  title={Kisoku: A 3.2B Parameter Language Model},
  author={Rodriguez, Joseph},
  year={2025},
  url={https://huggingface.co/0arch-io/kisoku-3.2b-base},
  note={Trained using Google TRC program}
}

Acknowledgments

  • Google TRC Program for providing TPU compute resources
  • Google MaxText Team for the training framework
  • DataComp Team for the DCLM-Baseline 1.0 dataset
  • Open source community for tools and libraries

Model Card Contact

Maintainer: Joseph Rodriguez Email: [email protected] Organization: 0ARCH

For questions, issues, or collaboration inquiries, please reach out via email or open an issue on the model repository.


Trained with MaxText • Powered by Google Cloud TPUs

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train 0arch-io/kisoku-3.2b-base