Tutorial: Inference-Time Alignment in Diffusion Models for Protein Design

This code is provided alongside the tutorial paper on inference-time alignment in diffusion models. The objective is to optimize multiple reward functions within a protein inverse folding model (p(x|c)), where $x$ represents a sequence, and $c$ denotes a backbone structure. For related refer to small molecule or images.

We employ an inverse folding model (mapping backbone structure to sequence) based on a discrete diffusion model as the foundational model. In this repository, we detail the process of optimizing various downstream reward functions in this diffusion model using inference-time techniques.

How to Run

Go to ./fmif folder. Then, the inference-time technique can be run as follows.

CUDA_VISIBLE_DEVICES=1 python eval_finetune.py --decoding 'SVDD' --reward_name 'LDDT'  --repeatnum 10 --batchsize 5

--decoding:
- SMC: Refer to Sec. 3.1 or papers .
- SVDD (a.k.a. value-based sampling): Sec. 3.2 or the paper
- NestedIS: Refert to Sec. 3.3
- Classifier guidance: Refer to Sec. 5.2 or the paper such as
--rewards:
- stability: This is a reward function trained in Wang and Uehara et al., 2024, which predicts Gibbs’s free energy from a sequence and a structure on the Megalscale dataset. For details, refer to the code.
- pLDDT: A common metric to characterize the confidence of prediction. It has been used as a specific proxy of stability.
- scRMSD: $| c - f (\hat{x}) |$ where $f$ is a forward folding model (ESMfold). While the pre-trained model is already a conditoinal diffusoin model, this is considered to be usesful to robustify the generated protein further.
- .... (Will be added more)
--repeat_num: When using SMC, SVDD, and Nested IS, we need to choose the duplication hyperparameter.
--batchsize: Batch size
--alpha: We set this as $0.5$ in SMC and classifier guidance by default. For SVDD, we choose $0.0$ by default.

Outputs

We condition several wild backbone structures in validation protein datasets. We save each generated protein as a pdb file in the folder ./sc_tmp/. We also record several important statistics in a pandas format in the folder ./log.

Results

Each blue point corresponds to the median RMSD of generated samples for each backbone structure. For example, when optimizing scRMSD, for some proteins, while naive inference procedures have certain inconsistency, the inference-time technique can make the generated result very consistent with the forward folding model.

Inference-Time Scaling Law

The performance improves as the computational budget increases. The following illustrates a case where the beam width increases when runinng value-based beam search (SVDD). While this increases computational time, it leads to a significant improvement in performance.

Installation

The pre-trained model is based on the code in multiflow code Campbell & Yim et al., 2024.
Then, to introduce weights on pre-trained models, run

python download_model_data.py

Then, the dataset will be placed on the folder ./datasets

To calculate the energy, we need to install Pyrosseta.
Note that our code also builts on ESMFold, OpenFold and ProteinMPNN.

Citation

If you use this codebase, then please cite

@misc{uehara2025rewardguidedcontrolledgenerationinferencetime,
      title={Reward-Guided Controlled Generation for Inference-Time Alignment in Diffusion Models: Tutorial and Review}, 
      author={Masatoshi Uehara and Yulai Zhao and Chenyu Wang and Xiner Li and Aviv Regev and Sergey Levine and Tommaso Biancalani},
      year={2025},
      eprint={2501.09685},
      url={https://arxiv.org/abs/2501.09685}, 
}

Name	Name	Last commit message	Last commit date
Latest commit “sux1ngyu” fix on sigmoid drop schedule Jan 20, 2025 0f7b2a6 · Jan 20, 2025 History 23 Commits
ProteinMPNN	ProteinMPNN	Initial commit	Oct 25, 2024
datasets	datasets	Add nested_IS	Oct 27, 2024
fmif	fmif	fix on sigmoid drop schedule	Jan 20, 2025
media	media	Add more explanation	Jan 19, 2025
metadata	metadata	Initial commit	Oct 25, 2024
multiflow	multiflow	Update RMSD	Jan 13, 2025
openfold	openfold	Initial commit	Oct 25, 2024
protein_oracle	protein_oracle	BFS pipeline	Jan 16, 2025
.gitignore	.gitignore	Update RMSD	Jan 13, 2025
README.md	README.md	Add explanation 2	Jan 19, 2025
evaluate_LDDT.ipynb	evaluate_LDDT.ipynb	Update RMSD	Jan 13, 2025
evaluate_scRMSD.ipynb	evaluate_scRMSD.ipynb	Update RMSD	Jan 13, 2025
evaluate_scaling.ipynb	evaluate_scaling.ipynb	Add more explanation	Jan 19, 2025
multiflow.yml	multiflow.yml	Initial commit	Oct 25, 2024
setup.py	setup.py	Initial commit	Oct 25, 2024
val.ipynb	val.ipynb	Initial commit	Oct 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tutorial: Inference-Time Alignment in Diffusion Models for Protein Design

How to Run

Outputs

Results

Inference-Time Scaling Law

Installation

Citation

About

Releases

Packages

Contributors 3

Languages

masa-ue/AlignInversePro

Folders and files

Latest commit

History

Repository files navigation

Tutorial: Inference-Time Alignment in Diffusion Models for Protein Design

How to Run

Outputs

Results

Inference-Time Scaling Law

Installation

Citation

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages