Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whole model gets offloaded to the CPU #1122

Open
SzymonOzog opened this issue Feb 4, 2025 · 5 comments
Open

Whole model gets offloaded to the CPU #1122

SzymonOzog opened this issue Feb 4, 2025 · 5 comments
Assignees

Comments

@SzymonOzog
Copy link

I'm running the following code to calculate an offset map:

     MODEL_ID = "deepseek-ai/DeepSeek-R1"
     device_map = calculate_offload_device_map(MODEL_ID, num_gpus=8, reserve_for_hessians=True, trust_remote_code=True)
    
     print("calculated device map", device_map)

After it finishes it decides to offload the whole model to my CPU which results in a very slow compression.

calculated device map OrderedDict({'model.embed_tokens': 'cpu', 'model.layers.0': 'cpu', 'model.layers.1': 'cpu', 'model.layers.2': 'cpu', 'model.layers.3': 'cpu', 'model.layers.4': 'cpu', 'model.layers.5': 'cpu', 'model.layers.6': 'cpu', 'model.layers.7': 'cpu', 'model.layers.8': 'cpu', 'model.layers.9': 'cpu', 'model.layers.10': 'cpu', 'model.layers.11': 'cpu', 'model.layers.12': 'cpu', 'model.layers.13': 'cpu', 'model.layers.14': 'cpu', 'model.layers.15': 'cpu', 'model.layers.16': 'cpu', 'model.layers.17': 'cpu', 'model.layers.18': 'cpu', 'model.layers.19': 'cpu', 'model.layers.20': 'cpu', 'model.layers.21': 'cpu', 'model.layers.22': 'cpu', 'model.layers.23': 'cpu', 'model.layers.24': 'cpu', 'model.layers.25': 'cpu', 'model.layers.26': 'cpu', 'model.layers.27': 'cpu', 'model.layers.28': 'disk', 'model.layers.29': 'disk', 'model.layers.30': 'disk', 'model.layers.31': 'disk', 'model.layers.32': 'disk', 'model.layers.33': 'disk', 'model.layers.34': 'disk', 'model.layers.35': 'disk', 'model.layers.36': 'disk', 'model.layers.37': 'disk', 'model.layers.38': 'disk', 'model.layers.39': 'disk', 'model.layers.40': 'disk', 'model.layers.41': 'disk', 'model.layers.42': 'disk', 'model.layers.43': 'disk', 'model.layers.44': 'disk', 'model.layers.45': 'disk', 'model.layers.46': 'disk', 'model.layers.47': 'disk', 'model.layers.48': 'disk', 'model.layers.49': 'disk', 'model.layers.50': 'disk', 'model.layers.51': 'disk', 'model.layers.52': 'disk', 'model.layers.53': 'disk', 'model.layers.54': 'disk', 'model.layers.55': 'disk', 'model.layers.56': 'disk', 'model.layers.57': 'disk', 'model.layers.58': 'disk', 'model.layers.59': 'disk', 'model.layers.60': 'disk', 'model.norm': 'disk', 'lm_head': 'disk'})

vllm 0.7.1
transformers 4.48.2
accelerate 1.0.1

Running on a node with 8xH100

@endic-sam928281
Copy link

Hello, we tried to solve the issue.

This is what we did:

Modified the calculate_offload_device_map function to better utilize available GPU memory. The changes include:

  1. Increased the reserved memory for quantization and hessians.
  2. Added a safety margin to prevent GPU out of memory errors.
  3. Adjusted the memory calculation to account for the model's total size.
  4. Implemented a more balanced distribution of layers across available GPUs.

You can review changes in this commit: endic-sam928281@757adce.

Caution

Disclaimer: The concept of solution was created by AI and you should never copy paste this code before you check the correctness of generated code. Solution might not be complete, you should use this code as an inspiration only.


Latta AI seeks to solve problems in open source projects as part of its mission to support developers around the world. Learn more about our mission at https://latta.ai/ourmission . If you no longer want Latta AI to attempt solving issues on your repository, you can block this account.

@kylesayrs kylesayrs self-assigned this Feb 4, 2025
@yunkchen
Copy link

same problem~

@AmbroseX
Copy link

AmbroseX commented Feb 18, 2025

Hi @SzymonOzog and @endic-sam928281,

I'm experiencing the same issue where the entire deepseek-r1 bf16 model gets offloaded to the CPU, even though I have 8 GPUs available. The first GPU only occupies 501M of memory, but the rest of the GPUs remain unused, resulting in very slow performance.

I noticed that the calculate_offload_device_map function is not effectively utilizing the available GPU memory. Has there been any progress or updates on resolving this issue? Any guidance or workarounds would be greatly appreciated.

My device is :
vllm 0.7.2
transformers 4.48.2
accelerate 1.0.1

a node with 8xH100

int8 W8A8 quantization

Thanks!

@twoapples1
Copy link

twoapples1 commented Feb 26, 2025

you can try with this, I solved the problem this way

import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.modifiers.smoothquant import SmoothQuantModifier

# NOTE: transformers 4.48.0 has an import error with DeepSeek.
# Please consider either downgrading your transformers version to a
# previous version or upgrading to a version where this bug is fixed

# select a Mixture of Experts model for quantization
MODEL_ID = "opensourcerelease/DeepSeek-V3-bf16"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, 
    #device_map=device_map,
    device_map="auto",
    torch_dtype=torch.bfloat16, 
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

DATASET_ID = "neuralmagic/LLM_compression_calibration"
DATASET_SPLIT = "train"
NUM_CALIBRATION_SAMPLES =64
MAX_SEQUENCE_LENGTH =1024

print(model.hf_device_map)

# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

def preprocess(example):
    return {
        "text": tokenizer.apply_chat_template(
            example["messages"],
            tokenize=False,
        )
    }
ds = ds.map(preprocess)

# Tokenize inputs.
def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )

ds = ds.map(tokenize, remove_columns=ds.column_names)

recipe = [
    GPTQModifier(
        targets="Linear",
        scheme="W8A8",
        ignore=["lm_head", "re:.*mlp.gate$"],
        offload_hessians=True,
        #ignore=["lm_head","re:.*mlp.gate$","re:.*mlp.experts*"],
    ),
]

SAVE_DIR = "./DeepSeek-V3-w8a8"

oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    trust_remote_code_model=True,
    save_compressed=True,
    output_dir=SAVE_DIR,
)

@halexan
Copy link

halexan commented Feb 27, 2025

Int8 W8A8 quantization is still very huge. Need 2 * 8 * A100 to deploy.

Has anyone tried int4 w4a16 quantization ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants