Skip to content

[CVPR 2025] Official implementation of ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way

Notifications You must be signed in to change notification settings

Bujiazi/ByTheWay

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 

Repository files navigation

ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way


Shanghai Jiao Tong University, University of Science and Technology of China,
The Chinese University of Hong Kong, Shanghai Artificial Intelligence Laboratory
(*Equal Contribution)(Corresponding Author)

CVPR 2025

arXiv


ByTheWay provides a training-free and plug-and-play option to enhance the overall quality of current T2V backbones.


📖 Click for the full abstract of ByTheWay

The text-to-video (T2V) generation models, offering convenient visual creation, have recently garnered increasing attention. Despite their substantial potential, the generated videos may present artifacts, including structural implausibility, temporal inconsistency, and a lack of motion, often resulting in near-static video. In this work, we have identified a correlation between the disparity of temporal attention maps across different blocks and the occurrence of temporal inconsistencies. Additionally, we have observed that the energy contained within the temporal attention maps is directly related to the magnitude of motion amplitude in the generated videos. Based on these observations, we present ByTheWay, a training-free method to improve the quality of text-to-video generation without introducing additional parameters, augmenting memory or sampling time. Specifically, ByTheWay is composed of two principal components: 1) Temporal Self-Guidance improves the structural plausibility and temporal consistency of generated videos by reducing the disparity between the temporal attention maps across various decoder blocks. 2) Fourier-based Motion Enhancement enhances the magnitude and richness of motion by amplifying the energy of the map. Extensive experiments demonstrate that ByTheWay significantly improves the quality of text-to-video generation with negligible additional cost.

🎈Demo Video

bytheway.mp4

🖋 News

  • ByTheWay accepted to CVPR 2025! (2025.2.27)
  • Code released! (2024.10.14)
  • Paper (v1) is available on arXiv! (2024.10.8)

🏗️ Todo

  • 🚀 Release the ByTheWay code
  • 🚀 Release paper

💻 ByTheWay Code

# ---------- compute the energy of temporal attention map ----------
def compute_energy(attn_prob):
    num_frames = attn_prob.shape[-1]
    attn_prob_copy = attn_prob.reshape(2, -1, heads, num_frames, num_frames)
    energy = (attn_prob_copy[-1] ** 2).mean(dim=1).mean(dim=1).sum(dim=-1).mean(dim=0) # formula (4) in paper
    return energy

# ---------- split high-frequency and low-frequency energy ----------
def split_freq(attn_prob, tau):
    num_frames = attn_prob.shape[-1]
    seq_index = num_frames // 2
    attn_prob_dft = torch.fft.fft(attn_prob, dim=-1)
    high_freq_indices = [idx for idx in range(num_frames) if seq_index - tau <=  idx  <=  seq_index + tau]
    low_freq_indices = [idx for idx in range(num_frames) if idx not in high_freq_indices]
    assert len(high_freq_indices) == 2 * tau + 1

    high_freq = attn_prob_dft[..., high_freq_indices]
    low_freq = attn_prob_dft[..., low_freq_indices]

    high_freq_abs = torch.abs(high_freq)
    low_freq_abs = torch.abs(low_freq)

    high_freq_abs = high_freq_abs.reshape(2, -1, num_frames, len(high_freq_indices))
    low_freq_abs = low_freq_abs.reshape(2, -1, num_frames, len(low_freq_indices))

    Eh = (high_freq_abs[-1] ** 2).sum(dim=-1).mean(dim=0).mean(dim=0) / num_frames
    El = (low_freq_abs[-1] ** 2).sum(dim=-1).mean(dim=0).mean(dim=0) / num_frames

    return Eh, El

# ---------- frequency component manipulation ----------
def motion_enhance(attn_prob, tau, beta):

    num_frames = attn_prob.shape[-1]
    seq_index = num_frames // 2
    attn_prob_dft = torch.fft.fft(attn_prob, dim=-1)
    high_freq_indices = [idx for idx in range(num_frames) if seq_index - tau <=  idx  <=  seq_index + tau]
    assert len(high_freq_indices) == 2 * tau + 1

    high_freq = attn_prob_dft[..., high_freq_indices]
    high_freq_scaled = high_freq * beta 
    attn_prob_dft[..., high_freq_indices] = high_freq_scaled

    attn_prob_scaled = torch.fft.ifft(attn_prob_dft, dim=-1).real

    sum_dim = attn_prob_scaled.sum(dim=-1, keepdim = True)
    attn_prob_scaled /= sum_dim
    attn_prob_scaled = attn_prob_scaled.reshape(-1, num_frames, num_frames)

    return attn_prob_scaled


# attention_probs (shape [B * H * W, F, F]) is the softmax matrix after performing temporal self attention
# attention_probs_up1 is the temporal attention map obtained from the corresponding module in up_blocks.1
# alpha, beta and tau are ByTheWay parameters

# ---------- Temporal Self Guidance ----------
E1 = compute_energy(attention_probs)
attention_probs_up1 = interpolate(attention_probs_up1) # interpolate in the spatial dimension
attention_probs = attention_probs + alpha * (attention_probs_up1 - attention_probs) # formula (3) in paper
E2 = compute_energy(attention_probs)

# ---------- Fourier-based Motion Enhancement ----------
E2_h, E2_l = split_freq(attention_probs, tau = tau)
beta_c = torch.sqrt((E1 - E2_l) / E2_h)
beta = max(beta, beta_c) # formula (7) in paper
attention_probs = motion_enhance(attention_probs, tau = tau, beta = beta) # formula (6) in paper
E3 = compute_energy(attention_probs)

# ByTheWay operations are performed in the first 20% denosing steps, in each temporal attention module including up_blocks.1/2/3

🔧 ByTheWay Parameters

Here, we provide some reference configurations for ByTheWay Parameters. You can adjust these parameters based on your own model and task requirements:

AnimateDiff:

$\tau = 7, \alpha = 0.6, \beta = 1.5$

VideoCrafter2:

$\tau = 7, \alpha = 0.1, \beta = 10$

📎 Citation

If you find this work helpful, please cite the following paper:

@article{bu2024broadway,
  title={Broadway: Boost your text-to-video generation model in a training-free way},
  author={Bu, Jiazi and Ling, Pengyang and Zhang, Pan and Wu, Tong and Dong, Xiaoyi and Zang, Yuhang and Cao, Yuhang and Lin, Dahua and Wang, Jiaqi},
  journal={arXiv preprint arXiv:2410.06241},
  year={2024}
}

📣 Disclaimer

This is the official code of ByTheWay. All the copyrights of the demo images and audio are from community users. Feel free to contact us if you would like to remove them.

💞 Acknowledgements

Our code is built upon the below repositories, we thank all the contributors for open-sourcing.

About

[CVPR 2025] Official implementation of ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published