Batch size and training steps #3

jhao6 · 2025-01-03T06:56:11Z

Hi,

I have two questions regarding your code:

What is the bath size for the pretraining? I found that it is set to 512 in the table 5 in the original paper, but it is 64 and the micro batch size is 16 in the code.
How many pretraining steps are taken in Table 1, 10k or 50k? If we use more pretraining steps, like 50k steps, how many steps are taken in the first pretraining stage on the random-selected data and how many steps are taken in the last pretraining stage on selected data.
Hope you can help me to figure them out. Thank you very much.

yuzc19 · 2025-01-12T04:34:33Z

Thanks for your question. For 1, we use 8 GPUs in our main experiment as stated here and the lightning package we use will sync the gradients across all GPUs (i.e., the batch size 64 you notice in the code is only for one GPU) since we use the default DDP strategy. So the total global batch size is 512. If you are using fewer GPUs, you may need to change the 64 to adapt to your configurations.

For 2, in Table 1, we train for 50k total steps for all methods. In the first pretraining stage, we train for 10k steps, and for every following model-aware pretraining stage, we train for 10k steps. Therefore, we have 4 model-aware stages in total.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch size and training steps #3

Batch size and training steps #3

jhao6 commented Jan 3, 2025

yuzc19 commented Jan 12, 2025 •

edited

Loading

Batch size and training steps #3

Batch size and training steps #3

Comments

jhao6 commented Jan 3, 2025

yuzc19 commented Jan 12, 2025 • edited Loading

yuzc19 commented Jan 12, 2025 •

edited

Loading