You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm confused about the learning rate scheduler used in training the main model. Is it the WSD scheduler mentioned in the paper? Does the learning rate scheduler applied to each stage or the entire pretraining process?
Thanks
The text was updated successfully, but these errors were encountered:
It applies to the entire pretraining, so we totally have 2000 steps for warmup, 48000 steps for stable, and 200 steps for decay. But let's say if you want to evaluate 10k step performance (which is in the stable stage), it is recommended to run an additional decay stage since it will give you better performance.
Hi,
I'm confused about the learning rate scheduler used in training the main model. Is it the WSD scheduler mentioned in the paper? Does the learning rate scheduler applied to each stage or the entire pretraining process?
Thanks
The text was updated successfully, but these errors were encountered: