Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Clarification] Regarding KV Cache quantization and FP8 Scales #1104

Open
nelaturuharsha opened this issue Jan 27, 2025 · 1 comment
Open
Assignees
Labels
question Further information is requested

Comments

@nelaturuharsha
Copy link

Hello, thanks for the great library which has been extremely useful.

I had a few questions regarding KV Cache quantization:

(1) Do scales ever get updated if simply kv_cache_dtype = 'fp8' is passed to vLLM without calibration via a recipe?
(2) Is the business logic for applying the fp8 scales for quant/dequant of the KV cache handled in compressed-tensors and is there a modification done in vLLM to ensure these are applied? [any reference would be great, as I think scales via JSON are what are discussed in the vLLM docs]
(3) Is there any advice regarding the effect/impact of needing learned scales in the first place for the KV Cache? Is there any impact you've noticed with its use in combo with static FP8 activations or dynamic activation quant?
(4) Would the right work-stream for implementing more fine-grained scales be via compressed-tensors + HF (considering vLLM only supports per tensor scales)?

Would appreciate your thoughts on this, thanks once again!

@dsikka dsikka self-assigned this Jan 28, 2025
@dsikka
Copy link
Collaborator

dsikka commented Jan 28, 2025

Hi @nelaturuharsha

(1) Can you expand what you're trying to run in this example? i.e a sample command?
(2) compressed-tensors does not support running inference with the quantized kv_cache. It only supports running calibration such that it can be quantized/the k/v scales can be determined. These optimized scales are then saved to disk and loaded in vllm during inference.
(3) Yes, more fine-grained scales would be implemented through compressed-tensors/HF but these would only be optimized to eventually run in vllm.

@dsikka dsikka added the question Further information is requested label Jan 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants