Model Summary

OLMoE with Adapters

This repository contains an extension of the OLMo model with adapter layers for parameter-efficient fine-tuning. By adding small adapter modules to the model, we can fine-tune it on downstream tasks while freezing most of the original parameters, resulting in much more efficient training.

Model Architecture

The OlmoEWithAdaptersForCausalLM model extends the original OLMo architecture by:

Adding small adapter layers (bottleneck layers) to each MLP block
Allowing selective freezing of the base model's parameters
Training only the adapter parameters (~0.1-1% of total parameters)

Key components:

OlmoEWithAdaptersMLP: MLP layer with additional adapter modules
OlmoEWithAdaptersDecoderLayer: Decoder layer incorporating adapter MLPs
OlmoEWithAdaptersModel: Full model with adapter-based decoder layers
OlmoEWithAdaptersForCausalLM: Causal language model with adapters

Training Script

The train_olmoe_adapters.py script provides a complete workflow for fine-tuning the model:

Features:

Parameter-efficient fine-tuning using adapters
Support for various datasets through Hugging Face datasets library
Customizable adapter size
Option to freeze/unfreeze different components
Training with AdamW optimizer and learning rate scheduling
Evaluation with perplexity metrics
Model checkpointing and saving

Usage:

python train.py \
    --model_name_or_path allenai/OLMo-7B \
    --adapter_size 64 \
    --freeze_base_model True \
    --dataset_name wikitext \
    --dataset_config_name wikitext-2-raw-v1 \
    --output_dir ./olmoe-adapter-finetuned \
    --num_train_epochs 3 \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --learning_rate 5e-5 \
    --warmup_steps 100 \
    --logging_steps 100 \
    --save_steps 1000 \
    --seed 42

Benefits of Adapter-Based Fine-Tuning

Efficiency: Train only ~0.1-1% of the parameters, dramatically reducing GPU memory requirements
Storage: Store only adapter weights rather than full fine-tuned models
Composability: Multiple adapters can be trained for different tasks and swapped at inference time
Reduced Overfitting: Lower parameter count helps prevent overfitting on small datasets

How to Use the Fine-Tuned Model

from transformers import OlmoTokenizer
from modeling_olmoe import OlmoEWithAdaptersForCausalLM

# Load the fine-tuned model
model = OlmoEWithAdaptersForCausalLM.from_pretrained("./olmoe-adapter-finetuned")
tokenizer = OlmoTokenizer.from_pretrained("./olmoe-adapter-finetuned")

# Generate text
inputs = tokenizer("Once upon a time", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Adapter Size Recommendations

The adapter size determines the parameter efficiency vs. performance trade-off:

Small datasets: 16-32 dimensions
Medium datasets: 64-128 dimensions
Large datasets: 128-256 dimensions

For most fine-tuning scenarios, an adapter size of 64 provides a good balance between efficiency and performance.

Downloads last month: 1

Charlie81
/

SkipMoE