Learning rate scheduler of the main model #2

ZigeW · 2024-09-23T03:10:30Z

Hi,

I'm confused about the learning rate scheduler used in training the main model. Is it the WSD scheduler mentioned in the paper? Does the learning rate scheduler applied to each stage or the entire pretraining process?

Thanks

yuzc19 · 2024-09-26T02:29:50Z

Hi! We use the WSD scheduler in the paper.

MATES/src/pretrain/pretrain.py

Line 319 in e1c4fe1

def get_wsd_lr(it: int) -> float:

It applies to the entire pretraining, so we totally have 2000 steps for warmup, 48000 steps for stable, and 200 steps for decay. But let's say if you want to evaluate 10k step performance (which is in the stable stage), it is recommended to run an additional decay stage since it will give you better performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Learning rate scheduler of the main model #2

Learning rate scheduler of the main model #2

ZigeW commented Sep 23, 2024 •

edited

Loading

yuzc19 commented Sep 26, 2024

Learning rate scheduler of the main model #2

Learning rate scheduler of the main model #2

Comments

ZigeW commented Sep 23, 2024 • edited Loading

yuzc19 commented Sep 26, 2024

ZigeW commented Sep 23, 2024 •

edited

Loading