Apertus-8B European Multilingual SFT Checkpoint (Step 80,000)
FSDP checkpoint for resuming multilingual SFT training on Apertus-8B.
Training Progress
| Metric | Value |
|---|---|
| Global Step | 80,000 / 256,137 |
| Epoch | 0.312 (31.2%) |
| Samples Processed | 2,560,000 / 8,139,164 |
| Loss | 0.73 |
| Accuracy | 78.4% |
Per-Language Position
| Language | Samples Seen | Total | Progress |
|---|---|---|---|
| German (de) | ~640,000 | 2,018,145 | 31.7% |
| Spanish (es) | ~640,000 | 2,050,976 | 31.2% |
| French (fr) | ~640,000 | 2,045,181 | 31.3% |
| Italian (it) | ~640,000 | 2,024,862 | 31.6% |
Training Configuration
model = "swiss-ai/Apertus-8B-Instruct-2509"
per_device_train_batch_size = 1
gradient_accumulation_steps = 4
learning_rate = 2e-6
num_train_epochs = 1
warmup_ratio = 0.03
lr_scheduler_type = "linear"
bf16 = True
gradient_checkpointing = True
Checkpoint Contents
checkpoint-80000/
βββ pytorch_model_fsdp_0/ # FSDP sharded model weights
βββ optimizer_0/ # Optimizer states
βββ rng_state_[0-7].pth # RNG states for 8 GPUs
βββ scheduler.pt # LR scheduler state
βββ trainer_state.json # Step, epoch, metrics
How to Resume Training
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
)
# Resume from checkpoint
trainer.train(resume_from_checkpoint="./checkpoint-80000")
Dataset
Pre-tokenized Arrow datasets with interleaved sampling from 4 European languages:
- Total: 8,139,164 samples
- Format:
input_ids,labels,attention_mask - Sequence Length: Variable (pre-tokenized)
Notes
- This is an FSDP sharded checkpoint (8 GPU training)
- Includes RNG states for exact dataloader position resumption
- ~15 days of training remaining to complete epoch 1
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for ctauchmann/apertus-8b-eu-sft-ckpt-80k
Base model
swiss-ai/Apertus-8B-2509 Finetuned
swiss-ai/Apertus-8B-Instruct-2509