Whole model gets offloaded to the CPU #1122

SzymonOzog · 2025-02-04T10:23:13Z

I'm running the following code to calculate an offset map:

     MODEL_ID = "deepseek-ai/DeepSeek-R1"
     device_map = calculate_offload_device_map(MODEL_ID, num_gpus=8, reserve_for_hessians=True, trust_remote_code=True)
    
     print("calculated device map", device_map)

After it finishes it decides to offload the whole model to my CPU which results in a very slow compression.

calculated device map OrderedDict({'model.embed_tokens': 'cpu', 'model.layers.0': 'cpu', 'model.layers.1': 'cpu', 'model.layers.2': 'cpu', 'model.layers.3': 'cpu', 'model.layers.4': 'cpu', 'model.layers.5': 'cpu', 'model.layers.6': 'cpu', 'model.layers.7': 'cpu', 'model.layers.8': 'cpu', 'model.layers.9': 'cpu', 'model.layers.10': 'cpu', 'model.layers.11': 'cpu', 'model.layers.12': 'cpu', 'model.layers.13': 'cpu', 'model.layers.14': 'cpu', 'model.layers.15': 'cpu', 'model.layers.16': 'cpu', 'model.layers.17': 'cpu', 'model.layers.18': 'cpu', 'model.layers.19': 'cpu', 'model.layers.20': 'cpu', 'model.layers.21': 'cpu', 'model.layers.22': 'cpu', 'model.layers.23': 'cpu', 'model.layers.24': 'cpu', 'model.layers.25': 'cpu', 'model.layers.26': 'cpu', 'model.layers.27': 'cpu', 'model.layers.28': 'disk', 'model.layers.29': 'disk', 'model.layers.30': 'disk', 'model.layers.31': 'disk', 'model.layers.32': 'disk', 'model.layers.33': 'disk', 'model.layers.34': 'disk', 'model.layers.35': 'disk', 'model.layers.36': 'disk', 'model.layers.37': 'disk', 'model.layers.38': 'disk', 'model.layers.39': 'disk', 'model.layers.40': 'disk', 'model.layers.41': 'disk', 'model.layers.42': 'disk', 'model.layers.43': 'disk', 'model.layers.44': 'disk', 'model.layers.45': 'disk', 'model.layers.46': 'disk', 'model.layers.47': 'disk', 'model.layers.48': 'disk', 'model.layers.49': 'disk', 'model.layers.50': 'disk', 'model.layers.51': 'disk', 'model.layers.52': 'disk', 'model.layers.53': 'disk', 'model.layers.54': 'disk', 'model.layers.55': 'disk', 'model.layers.56': 'disk', 'model.layers.57': 'disk', 'model.layers.58': 'disk', 'model.layers.59': 'disk', 'model.layers.60': 'disk', 'model.norm': 'disk', 'lm_head': 'disk'})

vllm 0.7.1
transformers 4.48.2
accelerate 1.0.1

Running on a node with 8xH100

The text was updated successfully, but these errors were encountered:

endic-sam928281 · 2025-02-04T10:28:55Z

Hello, we tried to solve the issue.

This is what we did:

Modified the calculate_offload_device_map function to better utilize available GPU memory. The changes include:

Increased the reserved memory for quantization and hessians.
Added a safety margin to prevent GPU out of memory errors.
Adjusted the memory calculation to account for the model's total size.
Implemented a more balanced distribution of layers across available GPUs.

You can review changes in this commit: endic-sam928281@757adce.

Caution

Disclaimer: The concept of solution was created by AI and you should never copy paste this code before you check the correctness of generated code. Solution might not be complete, you should use this code as an inspiration only.

Latta AI seeks to solve problems in open source projects as part of its mission to support developers around the world. Learn more about our mission at https://latta.ai/ourmission . If you no longer want Latta AI to attempt solving issues on your repository, you can block this account.

yunkchen · 2025-02-10T04:59:01Z

same problem~

AmbroseX · 2025-02-18T09:24:11Z

Hi @SzymonOzog and @endic-sam928281,

I'm experiencing the same issue where the entire deepseek-r1 bf16 model gets offloaded to the CPU, even though I have 8 GPUs available. The first GPU only occupies 501M of memory, but the rest of the GPUs remain unused, resulting in very slow performance.

I noticed that the calculate_offload_device_map function is not effectively utilizing the available GPU memory. Has there been any progress or updates on resolving this issue? Any guidance or workarounds would be greatly appreciated.

My device is :
vllm 0.7.2
transformers 4.48.2
accelerate 1.0.1

a node with 8xH100

int8 W8A8 quantization

Thanks!

twoapples1 · 2025-02-26T07:32:37Z

you can try with this, I solved the problem this way

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier

# NOTE: transformers 4.48.0 has an import error with DeepSeek.
# Please consider either downgrading your transformers version to a
# previous version or upgrading to a version where this bug is fixed

# select a Mixture of Experts model for quantization
MODEL_ID = "opensourcerelease/DeepSeek-V3-bf16"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, 
    #device_map=device_map,
    device_map="auto",
    torch_dtype=torch.bfloat16, 
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

DATASET_ID = "neuralmagic/LLM_compression_calibration"
DATASET_SPLIT = "train"
NUM_CALIBRATION_SAMPLES =64
MAX_SEQUENCE_LENGTH =1024

print(model.hf_device_map)

# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

def preprocess(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"],
            tokenize=False,
        )
    }
ds = ds.map(preprocess)

# Tokenize inputs.
def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )

ds = ds.map(tokenize, remove_columns=ds.column_names)

recipe = [
    GPTQModifier(
        targets="Linear",
        scheme="W8A8",
        ignore=["lm_head", "re:.*mlp.gate$"],
        offload_hessians=True,
        #ignore=["lm_head","re:.*mlp.gate$","re:.*mlp.experts*"],
    ),
]

SAVE_DIR = "./DeepSeek-V3-w8a8"

oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    trust_remote_code_model=True,
    save_compressed=True,
    output_dir=SAVE_DIR,
)

halexan · 2025-02-27T01:09:00Z

Int8 W8A8 quantization is still very huge. Need 2 * 8 * A100 to deploy.

Has anyone tried int4 w4a16 quantization ?

kylesayrs self-assigned this Feb 4, 2025

kylesayrs mentioned this issue Feb 11, 2025

Does llm-compressor support Deepseek v3 #1135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whole model gets offloaded to the CPU #1122

Whole model gets offloaded to the CPU #1122

SzymonOzog commented Feb 4, 2025

endic-sam928281 commented Feb 4, 2025

yunkchen commented Feb 10, 2025

AmbroseX commented Feb 18, 2025 •

edited

Loading

twoapples1 commented Feb 26, 2025 •

edited

Loading

halexan commented Feb 27, 2025 •

edited

Loading

Whole model gets offloaded to the CPU #1122

Whole model gets offloaded to the CPU #1122

Comments

SzymonOzog commented Feb 4, 2025

endic-sam928281 commented Feb 4, 2025

yunkchen commented Feb 10, 2025

AmbroseX commented Feb 18, 2025 • edited Loading

twoapples1 commented Feb 26, 2025 • edited Loading

halexan commented Feb 27, 2025 • edited Loading

AmbroseX commented Feb 18, 2025 •

edited

Loading

twoapples1 commented Feb 26, 2025 •

edited

Loading

halexan commented Feb 27, 2025 •

edited

Loading