Welcome to MoM! This repository provides the implementation of MoM: Linear Sequence Modeling with Mixture-of-Memories, on huggingface eco-system. MoM is compatible with all kinds of linear sequence modeling methods like: linear attention, SSM, linear RNN, etc. Here is an introductory artical about MoM (in Chinese) on Zhihu.
The following requirements should be satisfied:
- PyTorch >= 2.5
- Triton >=3.0
- einops
- transformers >=4.45.0
- datasets >=3.3.0
- causal-conv1d >=1.4.0
Install the package from source:
pip install -e .
Before training, make sure to preprocess your data by following the steps outlined in training/README.md.
To start training with default setup, simply run:
cd training
bash train.sh \
nodes=4 \
gpus=8 \
type=mom \
lr=3e-4 \
steps=30720 \
batch=8 \
update=1 \
warmup=1024 \
context=2048 \
path=SlimPajama/mom-15B \
project=SlimPajama \
model=configs/mom_340M.json \
tokenizer=fla-hub/gla-1.3B-100B \
data=SlimPajama-627B \
cache=data/chunk1/train
You can also
- Modify the script to adjust the modeling and training settings.
- e.g., modify examples/configs/mom_340M.json to adjust the MoM model structure.
To evaluate model checkpoints on commonsense reasoning benchmarks, we recommend you to run:
MODEL_PATH=training/SlimPajama/mom-15B/checkpoint-30720
accelerate launch --multi_gpu evals/harness.py --model hf \
--model_args pretrained=$MODEL_PATH,dtype=bfloat16 \
--tasks arc_easy,arc_challenge,hellaswag,lambada_standard,piqa,winogrande,wikitext \
--output_path eval_results \
--batch_size 32 \
--device cuda
To evaluate model checkpoints on recall-intensive tasks, we recommend you to run:
- Install lm_eval
cd lm-eval-harness
pip install -e .
- Run the script:
MODEL_PATH=../training/SlimPajama/mom-15B/checkpoint-30720
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python launch_local.py \
--batch-size 32 \
-t based_squad \
-t based_swde \
-t based_fda \
-t based_drop \
-t based_triviaqa \
-t based_nq_2048 \
-m $MODEL_PATH \
--context_length 2048 \
--answer_length 48 \
--cutting_context \
--limit -1 \
-p
This repo builds upon the open-source flash-linear-attention and the evaluation code is based on prefix-linear-attention. Happy experimenting! 🔥🚀🔥
If you find this repo useful, please consider citing our paper:
@article{du2025mom,
title={MoM: Linear Sequence Modeling with Mixture-of-Memories},
author={Du, Jusen and Sun, Weigao and Lan, Disen and Hu, Jiaxi and Cheng, Yu},
journal={arXiv preprint arXiv:2502.13685},
year={2025}
}