Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learning rate scheduler of the main model #2

Open
ZigeW opened this issue Sep 23, 2024 · 1 comment
Open

Learning rate scheduler of the main model #2

ZigeW opened this issue Sep 23, 2024 · 1 comment

Comments

@ZigeW
Copy link

ZigeW commented Sep 23, 2024

Hi,

I'm confused about the learning rate scheduler used in training the main model. Is it the WSD scheduler mentioned in the paper? Does the learning rate scheduler applied to each stage or the entire pretraining process?

Thanks

@yuzc19
Copy link
Contributor

yuzc19 commented Sep 26, 2024

Hi! We use the WSD scheduler in the paper.

def get_wsd_lr(it: int) -> float:

It applies to the entire pretraining, so we totally have 2000 steps for warmup, 48000 steps for stable, and 200 steps for decay. But let's say if you want to evaluate 10k step performance (which is in the stable stage), it is recommended to run an additional decay stage since it will give you better performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants