You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What is the bath size for the pretraining? I found that it is set to 512 in the table 5 in the original paper, but it is 64 and the micro batch size is 16 in the code.
How many pretraining steps are taken in Table 1, 10k or 50k? If we use more pretraining steps, like 50k steps, how many steps are taken in the first pretraining stage on the random-selected data and how many steps are taken in the last pretraining stage on selected data.
Hope you can help me to figure them out. Thank you very much.
The text was updated successfully, but these errors were encountered:
Thanks for your question. For 1, we use 8 GPUs in our main experiment as stated here and the lightning package we use will sync the gradients across all GPUs (i.e., the batch size 64 you notice in the code is only for one GPU) since we use the default DDP strategy. So the total global batch size is 512. If you are using fewer GPUs, you may need to change the 64 to adapt to your configurations.
For 2, in Table 1, we train for 50k total steps for all methods. In the first pretraining stage, we train for 10k steps, and for every following model-aware pretraining stage, we train for 10k steps. Therefore, we have 4 model-aware stages in total.
Hi,
I have two questions regarding your code:
Hope you can help me to figure them out. Thank you very much.
The text was updated successfully, but these errors were encountered: