Model Summary
OLMoE with Adapters
This repository contains an extension of the OLMo model with adapter layers for parameter-efficient fine-tuning. By adding small adapter modules to the model, we can fine-tune it on downstream tasks while freezing most of the original parameters, resulting in much more efficient training.
Model Architecture
The OlmoEWithAdaptersForCausalLM model extends the original OLMo architecture by:
- Adding small adapter layers (bottleneck layers) to each MLP block
- Allowing selective freezing of the base model's parameters
- Training only the adapter parameters (~0.1-1% of total parameters)
Key components:
OlmoEWithAdaptersMLP: MLP layer with additional adapter modulesOlmoEWithAdaptersDecoderLayer: Decoder layer incorporating adapter MLPsOlmoEWithAdaptersModel: Full model with adapter-based decoder layersOlmoEWithAdaptersForCausalLM: Causal language model with adapters
Training Script
The train_olmoe_adapters.py script provides a complete workflow for fine-tuning the model:
Features:
- Parameter-efficient fine-tuning using adapters
- Support for various datasets through Hugging Face datasets library
- Customizable adapter size
- Option to freeze/unfreeze different components
- Training with AdamW optimizer and learning rate scheduling
- Evaluation with perplexity metrics
- Model checkpointing and saving
Usage:
python train.py \
--model_name_or_path allenai/OLMo-7B \
--adapter_size 64 \
--freeze_base_model True \
--dataset_name wikitext \
--dataset_config_name wikitext-2-raw-v1 \
--output_dir ./olmoe-adapter-finetuned \
--num_train_epochs 3 \
--per_device_train_batch_size 4 \
--per_device_eval_batch_size 4 \
--learning_rate 5e-5 \
--warmup_steps 100 \
--logging_steps 100 \
--save_steps 1000 \
--seed 42
Benefits of Adapter-Based Fine-Tuning
- Efficiency: Train only ~0.1-1% of the parameters, dramatically reducing GPU memory requirements
- Storage: Store only adapter weights rather than full fine-tuned models
- Composability: Multiple adapters can be trained for different tasks and swapped at inference time
- Reduced Overfitting: Lower parameter count helps prevent overfitting on small datasets
How to Use the Fine-Tuned Model
from transformers import OlmoTokenizer
from modeling_olmoe import OlmoEWithAdaptersForCausalLM
# Load the fine-tuned model
model = OlmoEWithAdaptersForCausalLM.from_pretrained("./olmoe-adapter-finetuned")
tokenizer = OlmoTokenizer.from_pretrained("./olmoe-adapter-finetuned")
# Generate text
inputs = tokenizer("Once upon a time", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Adapter Size Recommendations
The adapter size determines the parameter efficiency vs. performance trade-off:
- Small datasets: 16-32 dimensions
- Medium datasets: 64-128 dimensions
- Large datasets: 128-256 dimensions
For most fine-tuning scenarios, an adapter size of 64 provides a good balance between efficiency and performance.
- Downloads last month
- 1