wikimedia/wikipedia
Viewer • Updated • 61.6M • 267k • 1.22k
Distilled with Distily library using teacher model gpt2 on dataset wikimedia/wikipedia.
GPT2LMHeadModel| Metric | | | :--- |
GPT2LMHeadModel -> GPT2LMHeadModel--- teacher model modules
+++ student model modules
@@ -4,7 +4,7 @@
(wpe): Embedding(1024, 768)
(drop): Dropout(p=0.1, inplace=False)
(h): ModuleList(
- (0-11): 12 x GPT2Block(
+ (0-5): 6 x GPT2Block(
(ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
(attn): GPT2FlashAttention2(
(c_attn): Conv1D()
Trained on 6,814,337 tokens from the wikimedia/wikipedia dataset.
9,90020231101.entrainDistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=5, loss_fn=raw_mse, layer_mapper=layer-2, projector=orthogonal))
The following hyperparameters were used during training:
0.00024842Adam with betas=(0.9,0.999) and epsilon=1e-08polynomial1.0DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=5, loss_fn=raw_mse, layer_mapper=layer-2, projector=orthogonal))<torch.optim.lr_scheduler.LambdaLR object at 0x7fed757e56c0>Nonedistilbert/distilgpt2NoneNone[('lm_head', False)]Falsegpt2FalseFalsewikimedia/wikipedia20231101.entraintext100000.0110.01.000TrueBase model
distilbert/distilgpt2