[Clarification] Regarding KV Cache quantization and FP8 Scales #1104

nelaturuharsha · 2025-01-27T23:19:45Z

Hello, thanks for the great library which has been extremely useful.

I had a few questions regarding KV Cache quantization:

(1) Do scales ever get updated if simply kv_cache_dtype = 'fp8' is passed to vLLM without calibration via a recipe?
(2) Is the business logic for applying the fp8 scales for quant/dequant of the KV cache handled in compressed-tensors and is there a modification done in vLLM to ensure these are applied? [any reference would be great, as I think scales via JSON are what are discussed in the vLLM docs]
(3) Is there any advice regarding the effect/impact of needing learned scales in the first place for the KV Cache? Is there any impact you've noticed with its use in combo with static FP8 activations or dynamic activation quant?
(4) Would the right work-stream for implementing more fine-grained scales be via compressed-tensors + HF (considering vLLM only supports per tensor scales)?

Would appreciate your thoughts on this, thanks once again!

dsikka · 2025-01-28T02:26:44Z

Hi @nelaturuharsha

(1) Can you expand what you're trying to run in this example? i.e a sample command?
(2) compressed-tensors does not support running inference with the quantized kv_cache. It only supports running calibration such that it can be quantized/the k/v scales can be determined. These optimized scales are then saved to disk and loaded in vllm during inference.
(3) Yes, more fine-grained scales would be implemented through compressed-tensors/HF but these would only be optimized to eventually run in vllm.

dsikka self-assigned this Jan 28, 2025

dsikka added the question Further information is requested label Jan 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Clarification] Regarding KV Cache quantization and FP8 Scales #1104

[Clarification] Regarding KV Cache quantization and FP8 Scales #1104

nelaturuharsha commented Jan 27, 2025

dsikka commented Jan 28, 2025

[Clarification] Regarding KV Cache quantization and FP8 Scales #1104

[Clarification] Regarding KV Cache quantization and FP8 Scales #1104

Comments

nelaturuharsha commented Jan 27, 2025

dsikka commented Jan 28, 2025