Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k_scale is on the meta device, we need a value to put in on cpu. #1192

Open
ZisIsNotZis opened this issue Feb 26, 2025 · 1 comment · May be fixed by neuralmagic/compressed-tensors#261
Open
Assignees
Labels
bug Something isn't working

Comments

@ZisIsNotZis
Copy link

ZisIsNotZis commented Feb 26, 2025

Describe the bug
model.save_pretrained raised k_scale is on the meta device, we need a value to put in on cpu. when quantizing Qwen2.5-14B-Instruct to w8a8k8 on 4090. Some weights are offloaded to CPU.

Expected behavior
Save successfully

Environment
Include all relevant environment information:

  1. OS: Ubuntu 25.04
  2. Python version: 3.12
  3. LLM Compressor version or commit hash: 0.4.1
  4. ML framework version(s): 2.5.1
  5. Other Python package versions: vllm=0.7.3 compressed-tensors=0.9.2 numpy=1.26.4 onnx=1.17.0
  6. Other relevant environment information: 4090 Driver Version: 560.35.03 CUDA Version: 12.6

To Reproduce

from transformers import AutoModelForCausalLM, AutoTokenizer
from datasets import load_dataset
from llmcompressor.transformers import oneshot
MODEL_ID = 'Qwen/Qwen2.5-14B-Instruct'
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    device_map='auto',
    torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
def process_and_tokenize(example):
    text = tokenizer.apply_chat_template(example["messages"], tokenize=False)
    return tokenizer(text, padding=False, max_length=MAX_SEQUENCE_LENGTH, truncation=True, add_special_tokens=False)
ds = ds.map(process_and_tokenize, remove_columns=ds.column_names)
recipe = """
quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ["lm_head"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 8
                        type: float
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    input_activations:
                        num_bits: 8
                        type: float
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    targets: ["Linear"]
            kv_cache_scheme:
                num_bits: 8
                type: float
                strategy: tensor
                dynamic: false
                symmetric: true
"""
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
SAVE_DIR = MODEL_ID.split("/")[1] + "-w8a8k8"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

Errors

     50 oneshot(
     51     model=model,
     52     dataset=ds,
   (...)
     55     num_calibration_samples=NUM_CALIBRATION_SAMPLES,
     56 )
     57 SAVE_DIR = MODEL_ID.split("/")[1] + "-w8a8k8"
---> 58 model.save_pretrained(SAVE_DIR, save_compressed=True)
     59 tokenizer.save_pretrained(SAVE_DIR)

File ~/.local/lib/python3.12/site-packages/llmcompressor/transformers/sparsification/compressed_tensors_utils.py:167, in modify_save_pretrained.<locals>.save_pretrained_compressed.<locals>.save_pretrained_wrapper(save_directory, sparsity_config, quantization_format, save_compressed, skip_compression_stats, disable_sparse_compression, **kwargs)
    165 state_dict = kwargs.pop("state_dict", None)
    166 if state_dict is None:
--> 167     state_dict = get_state_dict_offloaded_model(model)
    169 compressor = get_model_compressor(
    170     model=model,
    171     sparsity_config=sparsity_config,
   (...)
    176     disable_sparse_compression=disable_sparse_compression,
    177 )
    179 if compressor is None:
    180     # model is not compressed or quantized, save as normal

File ~/.local/lib/python3.12/site-packages/accelerate/utils/modeling.py:1693, in get_state_dict_offloaded_model(model)
   1690     continue
   1692 try:
-> 1693     with align_module_device(module, "cpu"):
   1694         module_state_dict = module.state_dict()
   1695 except MemoryError:

File /usr/lib/python3.12/contextlib.py:137, in _GeneratorContextManager.__enter__(self)
    135 del self.args, self.kwds, self.func
    136 try:
--> 137     return next(self.gen)
    138 except StopIteration:
    139     raise RuntimeError("generator didn't yield") from None

File ~/.local/lib/python3.12/site-packages/accelerate/utils/modeling.py:2094, in align_module_device(module, execution_device)
   2092 try:
   2093     for name in devices:
-> 2094         set_module_tensor_to_device(module, name, execution_device)
   2095     yield
   2096 finally:

File ~/.local/lib/python3.12/site-packages/accelerate/utils/modeling.py:278, in set_module_tensor_to_device(module, tensor_name, device, value, dtype, fp16_statistics, tied_params_map)
    275     return
    277 if old_value.device == torch.device("meta") and device not in ["meta", torch.device("meta")] and value is None:
--> 278     raise ValueError(f"{tensor_name} is on the meta device, we need a `value` to put in on {device}.")
    280 param = module._parameters[tensor_name] if tensor_name in module._parameters else None
    281 param_cls = type(param)

ValueError: k_scale is on the meta device, we need a `value` to put in on cpu.
@kylesayrs
Copy link
Collaborator

Hi @ZisIsNotZis!

Thanks for reporting this bug, I'm glad that we found this. I've linked a PR above to compressed-tensors which should hopefully fix the issue, please let me know otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants