|
|
--- |
|
|
language: |
|
|
- lg |
|
|
- en |
|
|
library_name: unsloth |
|
|
pipeline_tag: text-generation |
|
|
license: llama2 |
|
|
base_model: unsloth/gemma-2-2b-it |
|
|
tags: |
|
|
- luganda |
|
|
- gemma |
|
|
- pretrained |
|
|
- wikipedia |
|
|
- unsloth |
|
|
datasets: |
|
|
- wikimedia/wikipedia |
|
|
--- |
|
|
# Gemma-2-2b-it Pretrained for Luganda |
|
|
|
|
|
## Model Description |
|
|
This is a continued pretraining of the Gemma-2-2b-it model on Luganda text data. The model has been pretrained on Wikipedia Luganda articles to adapt it for Luganda language understanding and generation. |
|
|
|
|
|
## Model Details |
|
|
- **Base Model**: unsloth/gemma-2-2b-it |
|
|
- **Pretraining Data**: |
|
|
- Luganda Wikipedia articles (wikimedia/wikipedia 20231101.lg) |
|
|
- **Training Method**: LoRA with unsloth optimization |
|
|
- **Context Length**: 2048 tokens |
|
|
- **Training Hardware**: Tesla T4 GPU |
|
|
|
|
|
## Training Process |
|
|
The model was trained using the following configuration: |
|
|
|
|
|
### LoRA Configuration |
|
|
- LoRA rank (r): 128 |
|
|
- Target modules: |
|
|
- q_proj, k_proj, v_proj, o_proj |
|
|
- gate_proj, up_proj, down_proj |
|
|
- embed_tokens, lm_head |
|
|
- LoRA alpha: 32 |
|
|
- LoRA dropout: 0 |
|
|
- Used RS-LoRA (Rank Stabilized LoRA) |
|
|
|
|
|
### Training Parameters |
|
|
- Batch size: 2 with gradient accumulation steps of 8 |
|
|
- Learning rates: |
|
|
- General: 5e-5 |
|
|
- Embeddings: 1e-6 (reduced for stability) |
|
|
- Training epochs: 10 |
|
|
- Warmup steps: 10 |
|
|
- Warmup ratio: 0.1 |
|
|
- Weight decay: 0.01 |
|
|
- Optimizer: AdamW 8-bit |
|
|
- LR scheduler: Linear |
|
|
|
|
|
### Data Processing |
|
|
The training data was processed using the following template: |
|
|
|
|
|
```python |
|
|
Ekyawandiikibwa kya Wikipedia |
|
|
### Omutwe: {title} |
|
|
|
|
|
### Akawayiro: |
|
|
{text} |
|
|
``` |
|
|
|
|
|
## Checkpoints |
|
|
This repository contains multiple checkpoints from the pretraining process: |
|
|
- checkpoint-500 |
|
|
- checkpoint-1000 |
|
|
- checkpoint-1500 |
|
|
- checkpoint-2000 |
|
|
- checkpoint-2500 |
|
|
- checkpoint-2530 (final) |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from unsloth import FastLanguageModel |
|
|
import torch |
|
|
|
|
|
# Load the model |
|
|
model, tokenizer = FastLanguageModel.from_pretrained( |
|
|
model_name = "Bronsn/gemma-2-2b-it-pretrained", |
|
|
max_seq_length = 2048, |
|
|
dtype = None, # Auto-detect |
|
|
load_in_4bit = True, |
|
|
) |
|
|
|
|
|
# Example usage |
|
|
text = "Ekyawandiikibwa kya Wikipedia\n### Omutwe: Uganda\n\n### Akawayiro:\n" |
|
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
outputs = model.generate(**inputs, max_new_tokens=100) |
|
|
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
- The model is specifically adapted for Luganda text understanding and generation |
|
|
- Performance may vary on dialectal variations or code-mixed text |
|
|
- The model maintains the base Gemma-2-2b-it limitations |
|
|
|
|
|
## Citation |
|
|
If you use this model, please cite: |
|
|
``` |
|
|
@misc{luganda-gemma-pretrained, |
|
|
author = {Bronsn}, |
|
|
title = {Gemma-2-2b-it Pretrained for Luganda}, |
|
|
year = {2025}, |
|
|
publisher = {HuggingFace} |
|
|
} |
|
|
``` |
|
|
|
|
|
## License |
|
|
This model inherits the licensing terms from the base Gemma-2-2b-it model. For more details, please refer to [Gemma's license](https://ai.google.dev/gemma/terms). |
|
|
|