grad_norm becomes nan when finetune 9b models #12

zero90169 · 2024-11-18T06:31:37Z

First, thanks for your great works. I've tried to finetune the Yi-Coder-9B-Chat models on my own dataset but here comes the problems.

Problems

'grad_norm' becomes nan when I try to finetune the Yi-Coder-9B-Chat models

Details Description

In the first step, the grad_norm becomes nan, and later the loss becomes zero due to the ''grad_norm' nan issues.

{'loss': 9.6782, 'grad_norm': nan, 'learning_rate': 0.0008461538461538462, 'epoch': 0.15}                                                                                                                                                                                                                                                                         
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0006923076923076923, 'epoch': 0.31}                                                                                                                                                                                                                                                                            
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0005384615384615384, 'epoch': 0.46}                                                                                                                                                                                                                                                                            
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.00038461538461538467, 'epoch': 0.62}                                                                                                                                                                                                                                                                           
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0002307692307692308, 'epoch': 0.77}                                                                                                                                                                                                                                                                            
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 7.692307692307693e-05, 'epoch': 0.92}

But when I use the same code and change the model to CodeLlama-13b-Instruct-hf everything works as my expection.

Reproduce Code

I've changed the dataset from my own dataset to public dataset Genshin_Character_instruction/Genshin_Character_instruction.json, it can be found the huggingface.
link: https://huggingface.co/datasets/YanFu0320/Genshin_Character_instruction

from datasets import load_dataset
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM, SFTConfig
import os

os.environ["NCCL_DEBUG"] = "INFO"
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "INFO"
os.environ["TORCH_SHOW_CPP_STACKTRACES"] = "1"


base_model_name = "MY_PATH_TO_MODEL/Yi-Coder-9B-Chat"

dataset = load_dataset(
    "json",
    data_files="MY_PATH_TO_DATASET/Genshin_Character_instruction/Genshin_Character_instruction.json",
    split="train",
)

print(dataset)


def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example["instruction"])):
        text = f"<|startoftext|>user {example['instruction'][i]} <|im_end|> \n <|startoftext|>assistant \n ### Answer: {example['output'][i]} <|im_end|>"
        output_texts.append(text)
    return output_texts


result_dir = "save_model"

training_args = SFTConfig(
    report_to="none",
    output_dir=result_dir,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=1e-3,
    logging_steps=8,
    num_train_epochs=1,
    save_steps=200,
    bf16=True,
    gradient_checkpointing=True,
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)


def find_all_linear_names(model):
    cls = torch.nn.Linear
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split(".")
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if "lm_head" in lora_module_names:  # needed for 16-bit
        lora_module_names.remove("lm_head")
    return list(lora_module_names)


models = find_all_linear_names(base_model)

print(models)

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=models,
)


tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)

response_template = "<|startoftext|>assistant"
collator = DataCollatorForCompletionOnlyLM(
    response_template=response_template, tokenizer=tokenizer, mlm=False
)

max_seq_length = 256
trainer = SFTTrainer(
    model=base_model,
    formatting_func=formatting_prompts_func,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    data_collator=collator,
    args=training_args,
)

trainer.train()


OUTPUT_DIR = "save_genshin"
output_dir = os.path.join(result_dir, OUTPUT_DIR)

trainer.model.save_pretrained(output_dir)
trainer.model.config.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

System and Env setting

System

Platform: AzureML
GPU: A100

Related package version

accelerate==1.0.1
bitsandbytes==0.42.0
deepspeed==0.15.2
openai==1.40.0
tokenizers==0.20.3
torch==2.4.0
torchvision==0.19.0
vllm==0.6.3.post1
transformers==4.45.2
trl==0.11.0
peft==0.11.0
flash-attn==2.6.2

The text was updated successfully, but these errors were encountered:

nuoma · 2024-11-18T07:48:08Z

Sorry but I don't have a clear answer to this.

As we have discussed on Discord, there are many details in this process that could go wrong, especially considering you are using your own training framework. Here are some usual suspects we checked:

You mentioned your framework works well with other models like deepseek-coder, llama3, mistral.
for grad_norm nan, check if the training precision is set to bf16.
quality check your dataset.
consider update bitsandbytes==0.42.0 to newer version, developers have complained about similar issues. Also check other dependencies like deepspeed.
Very large loss at the beginning might indicate LR is too large, and you have already tried a smaller value of 1e-5 and gradient_clipping.

Hope there are smarter minds that can answer this.

zero90169 · 2024-11-18T09:39:43Z

Thank you for your effort. I will spend more time focusing on this issue. If I have something updated, I'll post it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

grad_norm becomes nan when finetune 9b models #12

grad_norm becomes nan when finetune 9b models #12

zero90169 commented Nov 18, 2024 •

edited

Loading

nuoma commented Nov 18, 2024

zero90169 commented Nov 18, 2024

grad_norm becomes nan when finetune 9b models #12

grad_norm becomes nan when finetune 9b models #12

Comments

zero90169 commented Nov 18, 2024 • edited Loading

Problems

Details Description

Reproduce Code

System and Env setting

nuoma commented Nov 18, 2024

zero90169 commented Nov 18, 2024

zero90169 commented Nov 18, 2024 •

edited

Loading