About rope theta
#20
by
spcmxxd
- opened
Dear author,
I notice that in class RotaryEmbedding, base = base * self.rope_ratio, where base == 10000 and rope_ratio == 10000.
May I ask that, at the training stage, glm-4-9b-chat-1m is trained with such values, i.e., base=10000*10000=10^8 ? So that the training and inference stages have the same hyperparameter values?