Is DeepGEMM directly applicable to backward in training? #10

YouJiacheng · 2025-02-26T07:02:37Z

bwd of GEMM is two GEMMs, but I wonder if I need to take some special care of the range of gradients?

YouJiacheng · 2025-02-26T07:13:11Z

Oh, so we must write a quantization kernel that produces the correct lhs_scales and rhs_scales.

zhipeng93 · 2025-02-26T07:31:34Z

We need a fp8 gemm with 128x1 LHS scaling and 1x128 RHS scaling?

LyricZhao · 2025-02-26T08:14:23Z

We provide this library mainly for inference. So this library only supports DGRAD, not WGRAD.

In my understanding, WGRAD support needs more than a GEMM kernel, but also some utility fused kernels (e.g. transposing, fused with casting, fused with SwiGLU, fused with MoE layout). We want this library to be clean, so we didn't open-source them.

We may later release the WGRAD kernel, we will discuss about it internally :)

YouJiacheng · 2025-02-26T08:20:57Z

Thank you Chenggang!
Yup I forgot that WGRAD needs to transpose matrices.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is DeepGEMM directly applicable to backward in training? #10

Is DeepGEMM directly applicable to backward in training? #10

YouJiacheng commented Feb 26, 2025

YouJiacheng commented Feb 26, 2025

zhipeng93 commented Feb 26, 2025

LyricZhao commented Feb 26, 2025

YouJiacheng commented Feb 26, 2025

Is DeepGEMM directly applicable to backward in training? #10

Is DeepGEMM directly applicable to backward in training? #10

Comments

YouJiacheng commented Feb 26, 2025

YouJiacheng commented Feb 26, 2025

zhipeng93 commented Feb 26, 2025

LyricZhao commented Feb 26, 2025

YouJiacheng commented Feb 26, 2025