k_scale is on the meta device, we need a `value` to put in on cpu. #1192

ZisIsNotZis · 2025-02-26T10:09:23Z

Describe the bug
model.save_pretrained raised k_scale is on the meta device, we need a value to put in on cpu. when quantizing Qwen2.5-14B-Instruct to w8a8k8 on 4090. Some weights are offloaded to CPU.

Expected behavior
Save successfully

Environment
Include all relevant environment information:

OS: Ubuntu 25.04
Python version: 3.12
LLM Compressor version or commit hash: 0.4.1
ML framework version(s): 2.5.1
Other Python package versions: vllm=0.7.3 compressed-tensors=0.9.2 numpy=1.26.4 onnx=1.17.0
Other relevant environment information: 4090 Driver Version: 560.35.03 CUDA Version: 12.6

To Reproduce

from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from llmcompressor.transformers import oneshot
MODEL_ID = 'Qwen/Qwen2.5-14B-Instruct'
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map='auto',
    torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
def process_and_tokenize(example):
    text = tokenizer.apply_chat_template(example["messages"], tokenize=False)
    return tokenizer(text, padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
ds = ds.map(process_and_tokenize, remove_columns=ds.column_names)
recipe = """
quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 8
                        type: float
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    input_activations:
                        num_bits: 8
                        type: float
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    targets: ["Linear"]
            kv_cache_scheme:
                num_bits: 8
                type: float
                strategy: tensor
                dynamic: false
                symmetric: true
"""
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
SAVE_DIR = MODEL_ID.split("/")[1] + "-w8a8k8"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

Errors

     50 oneshot(
     51     model=model,
     52     dataset=ds,
   (...)
     55     num_calibration_samples=NUM_CALIBRATION_SAMPLES,
     56 )
     57 SAVE_DIR = MODEL_ID.split("/")[1] + "-w8a8k8"
---> 58 model.save_pretrained(SAVE_DIR, save_compressed=True)
     59 tokenizer.save_pretrained(SAVE_DIR)

File ~/.local/lib/python3.12/site-packages/llmcompressor/transformers/sparsification/compressed_tensors_utils.py:167, in modify_save_pretrained.<locals>.save_pretrained_compressed.<locals>.save_pretrained_wrapper(save_directory, sparsity_config, quantization_format, save_compressed, skip_compression_stats, disable_sparse_compression, **kwargs)
    165 state_dict = kwargs.pop("state_dict", None)
    166 if state_dict is None:
--> 167     state_dict = get_state_dict_offloaded_model(model)
    169 compressor = get_model_compressor(
    170     model=model,
    171     sparsity_config=sparsity_config,
   (...)
    176     disable_sparse_compression=disable_sparse_compression,
    177 )
    179 if compressor is None:
    180     # model is not compressed or quantized, save as normal

File ~/.local/lib/python3.12/site-packages/accelerate/utils/modeling.py:1693, in get_state_dict_offloaded_model(model)
   1690     continue
   1692 try:
-> 1693     with align_module_device(module, "cpu"):
   1694         module_state_dict = module.state_dict()
   1695 except MemoryError:

File /usr/lib/python3.12/contextlib.py:137, in _GeneratorContextManager.__enter__(self)
    135 del self.args, self.kwds, self.func
    136 try:
--> 137     return next(self.gen)
    138 except StopIteration:
    139     raise RuntimeError("generator didn't yield") from None

File ~/.local/lib/python3.12/site-packages/accelerate/utils/modeling.py:2094, in align_module_device(module, execution_device)
   2092 try:
   2093     for name in devices:
-> 2094         set_module_tensor_to_device(module, name, execution_device)
   2095     yield
   2096 finally:

File ~/.local/lib/python3.12/site-packages/accelerate/utils/modeling.py:278, in set_module_tensor_to_device(module, tensor_name, device, value, dtype, fp16_statistics, tied_params_map)
    275     return
    277 if old_value.device == torch.device("meta") and device not in ["meta", torch.device("meta")] and value is None:
--> 278     raise ValueError(f"{tensor_name} is on the meta device, we need a `value` to put in on {device}.")
    280 param = module._parameters[tensor_name] if tensor_name in module._parameters else None
    281 param_cls = type(param)

ValueError: k_scale is on the meta device, we need a `value` to put in on cpu.

The text was updated successfully, but these errors were encountered:

kylesayrs · 2025-02-26T17:57:02Z

Hi @ZisIsNotZis!

Thanks for reporting this bug, I'm glad that we found this. I've linked a PR above to compressed-tensors which should hopefully fix the issue, please let me know otherwise.

ZisIsNotZis added the bug Something isn't working label Feb 26, 2025

kylesayrs self-assigned this Feb 26, 2025

kylesayrs linked a pull request Feb 26, 2025 that will close this issue

[Bugfix] Support offloaded parameters when initializing KV cache parameters neuralmagic/compressed-tensors#261

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k_scale is on the meta device, we need a `value` to put in on cpu. #1192

k_scale is on the meta device, we need a `value` to put in on cpu. #1192

ZisIsNotZis commented Feb 26, 2025 •

edited

Loading

kylesayrs commented Feb 26, 2025

k_scale is on the meta device, we need a value to put in on cpu. #1192

k_scale is on the meta device, we need a value to put in on cpu. #1192

Comments

ZisIsNotZis commented Feb 26, 2025 • edited Loading

kylesayrs commented Feb 26, 2025

k_scale is on the meta device, we need a `value` to put in on cpu. #1192

k_scale is on the meta device, we need a `value` to put in on cpu. #1192

ZisIsNotZis commented Feb 26, 2025 •

edited

Loading