A repository containing beta implementation for SLTrain: a sparse plus low-rank approach for parameter and memory efficient pretraining, which has been accepted to NeurIPS 2024. Preprint available on http://arxiv.org/abs/2406.02214.
The main idea is to re-parameterize linear layer with low-rank and sparse factors for improved parameter and memory efficiency.
W = BA + S,
where B, A model the low-rank component and S models the sparse component. S has a random sparsity pattern.
Below, we show how the learned weights L + S enlarges the spectrum. In particular, the L component primarily learns the head singular value spectrum and the S component primarily learns the tail spectrum.
Build cpp extensions via
cd ./sparse-lora
pip install .
Run the scripts placed in scripts/llm_pretrain/. Typical usage:
torchrun --standalone --nproc_per_node 1 torchrun_main.py \
--model_config configs/llama_60m.json \
--lr 0.003 \
--peft_model sltrain\
--optimizer adamw \
--rank 128 \
--sp_ratio 0.03 \ # sparsity delta
--batch_size 256 \
--total_batch_size 512 \
--num_training_steps 11000 \
--warmup_steps 1100 \
--weight_decay 0 \
--dtype bfloat16 \
--eval_every 1000 \
--lora_alpha 32
@inproceedings{han2024sltrain,
title={{SLTrain}: a sparse plus low-rank approach for parameter and memory efficient pretraining},
author={Han, Andi and Li, Jiaxiang and Huang, Wei and Hong, Mingyi and Takeda, Akiko and Jawanpuria, Pratik and Mishra, Bamdev},
booktitle = {Advances in Neural Information Processing Systems},
volume = {37},
year={2024}
}