You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, thanks for the great library which has been extremely useful.
I had a few questions regarding KV Cache quantization:
(1) Do scales ever get updated if simply kv_cache_dtype = 'fp8' is passed to vLLM without calibration via a recipe?
(2) Is the business logic for applying the fp8 scales for quant/dequant of the KV cache handled in compressed-tensors and is there a modification done in vLLM to ensure these are applied? [any reference would be great, as I think scales via JSON are what are discussed in the vLLM docs]
(3) Is there any advice regarding the effect/impact of needing learned scales in the first place for the KV Cache? Is there any impact you've noticed with its use in combo with static FP8 activations or dynamic activation quant?
(4) Would the right work-stream for implementing more fine-grained scales be via compressed-tensors + HF (considering vLLM only supports per tensor scales)?
Would appreciate your thoughts on this, thanks once again!
The text was updated successfully, but these errors were encountered:
(1) Can you expand what you're trying to run in this example? i.e a sample command?
(2) compressed-tensors does not support running inference with the quantized kv_cache. It only supports running calibration such that it can be quantized/the k/v scales can be determined. These optimized scales are then saved to disk and loaded in vllm during inference.
(3) Yes, more fine-grained scales would be implemented through compressed-tensors/HF but these would only be optimized to eventually run in vllm.
Hello, thanks for the great library which has been extremely useful.
I had a few questions regarding KV Cache quantization:
(1) Do scales ever get updated if simply kv_cache_dtype = 'fp8' is passed to vLLM without calibration via a recipe?
(2) Is the business logic for applying the fp8 scales for quant/dequant of the KV cache handled in compressed-tensors and is there a modification done in vLLM to ensure these are applied? [any reference would be great, as I think scales via JSON are what are discussed in the vLLM docs]
(3) Is there any advice regarding the effect/impact of needing learned scales in the first place for the KV Cache? Is there any impact you've noticed with its use in combo with static FP8 activations or dynamic activation quant?
(4) Would the right work-stream for implementing more fine-grained scales be via compressed-tensors + HF (considering vLLM only supports per tensor scales)?
Would appreciate your thoughts on this, thanks once again!
The text was updated successfully, but these errors were encountered: