Model Summary

OLMoE with Adapters

This repository contains an extension of the OLMo model with adapter layers for parameter-efficient fine-tuning. By adding small adapter modules to the model, we can fine-tune it on downstream tasks while freezing most of the original parameters, resulting in much more efficient training.

Model Architecture

The OlmoEWithAdaptersForCausalLM model extends the original OLMo architecture by:

  1. Adding small adapter layers (bottleneck layers) to each MLP block
  2. Allowing selective freezing of the base model's parameters
  3. Training only the adapter parameters (~0.1-1% of total parameters)

Key components:

  • OlmoEWithAdaptersMLP: MLP layer with additional adapter modules
  • OlmoEWithAdaptersDecoderLayer: Decoder layer incorporating adapter MLPs
  • OlmoEWithAdaptersModel: Full model with adapter-based decoder layers
  • OlmoEWithAdaptersForCausalLM: Causal language model with adapters

Training Script

The train_olmoe_adapters.py script provides a complete workflow for fine-tuning the model:

Features:

  • Parameter-efficient fine-tuning using adapters
  • Support for various datasets through Hugging Face datasets library
  • Customizable adapter size
  • Option to freeze/unfreeze different components
  • Training with AdamW optimizer and learning rate scheduling
  • Evaluation with perplexity metrics
  • Model checkpointing and saving

Usage:

python train.py \
    --model_name_or_path allenai/OLMo-7B \
    --adapter_size 64 \
    --freeze_base_model True \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --output_dir ./olmoe-adapter-finetuned \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --learning_rate 5e-5 \
    --warmup_steps 100 \
    --logging_steps 100 \
    --save_steps 1000 \
    --seed 42

Benefits of Adapter-Based Fine-Tuning

  1. Efficiency: Train only ~0.1-1% of the parameters, dramatically reducing GPU memory requirements
  2. Storage: Store only adapter weights rather than full fine-tuned models
  3. Composability: Multiple adapters can be trained for different tasks and swapped at inference time
  4. Reduced Overfitting: Lower parameter count helps prevent overfitting on small datasets

How to Use the Fine-Tuned Model

from transformers import OlmoTokenizer
from modeling_olmoe import OlmoEWithAdaptersForCausalLM

# Load the fine-tuned model
model = OlmoEWithAdaptersForCausalLM.from_pretrained("./olmoe-adapter-finetuned")
tokenizer = OlmoTokenizer.from_pretrained("./olmoe-adapter-finetuned")

# Generate text
inputs = tokenizer("Once upon a time", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Adapter Size Recommendations

The adapter size determines the parameter efficiency vs. performance trade-off:

  • Small datasets: 16-32 dimensions
  • Medium datasets: 64-128 dimensions
  • Large datasets: 128-256 dimensions

For most fine-tuning scenarios, an adapter size of 64 provides a good balance between efficiency and performance.

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Charlie81/SkipMoE