Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grad_norm becomes nan when finetune 9b models #12

Open
zero90169 opened this issue Nov 18, 2024 · 2 comments
Open

grad_norm becomes nan when finetune 9b models #12

zero90169 opened this issue Nov 18, 2024 · 2 comments

Comments

@zero90169
Copy link

zero90169 commented Nov 18, 2024

First, thanks for your great works. I've tried to finetune the Yi-Coder-9B-Chat models on my own dataset but here comes the problems.

Problems

'grad_norm' becomes nan when I try to finetune the Yi-Coder-9B-Chat models

Details Description

In the first step, the grad_norm becomes nan, and later the loss becomes zero due to the ''grad_norm' nan issues.

{'loss': 9.6782, 'grad_norm': nan, 'learning_rate': 0.0008461538461538462, 'epoch': 0.15}                                                                                                                                                                                                                                                                         
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0006923076923076923, 'epoch': 0.31}                                                                                                                                                                                                                                                                            
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0005384615384615384, 'epoch': 0.46}                                                                                                                                                                                                                                                                            
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.00038461538461538467, 'epoch': 0.62}                                                                                                                                                                                                                                                                           
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 0.0002307692307692308, 'epoch': 0.77}                                                                                                                                                                                                                                                                            
{'loss': 0.0, 'grad_norm': nan, 'learning_rate': 7.692307692307693e-05, 'epoch': 0.92} 

But when I use the same code and change the model to CodeLlama-13b-Instruct-hf everything works as my expection.

Reproduce Code

I've changed the dataset from my own dataset to public dataset Genshin_Character_instruction/Genshin_Character_instruction.json, it can be found the huggingface.
link: https://huggingface.co/datasets/YanFu0320/Genshin_Character_instruction

from datasets import load_dataset
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import SFTTrainer, DataCollatorForCompletionOnlyLM, SFTConfig
import os

os.environ["NCCL_DEBUG"] = "INFO"
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "INFO"
os.environ["TORCH_SHOW_CPP_STACKTRACES"] = "1"


base_model_name = "MY_PATH_TO_MODEL/Yi-Coder-9B-Chat"

dataset = load_dataset(
    "json",
    data_files="MY_PATH_TO_DATASET/Genshin_Character_instruction/Genshin_Character_instruction.json",
    split="train",
)

print(dataset)


def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example["instruction"])):
        text = f"<|startoftext|>user {example['instruction'][i]} <|im_end|> \n <|startoftext|>assistant \n ### Answer: {example['output'][i]} <|im_end|>"
        output_texts.append(text)
    return output_texts


result_dir = "save_model"

training_args = SFTConfig(
    report_to="none",
    output_dir=result_dir,
    per_device_train_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=1e-3,
    logging_steps=8,
    num_train_epochs=1,
    save_steps=200,
    bf16=True,
    gradient_checkpointing=True,
)

base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
)


def find_all_linear_names(model):
    cls = torch.nn.Linear
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split(".")
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])
    if "lm_head" in lora_module_names:  # needed for 16-bit
        lora_module_names.remove("lm_head")
    return list(lora_module_names)


models = find_all_linear_names(base_model)

print(models)

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=models,
)


tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)

response_template = "<|startoftext|>assistant"
collator = DataCollatorForCompletionOnlyLM(
    response_template=response_template, tokenizer=tokenizer, mlm=False
)

max_seq_length = 256
trainer = SFTTrainer(
    model=base_model,
    formatting_func=formatting_prompts_func,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    data_collator=collator,
    args=training_args,
)

trainer.train()


OUTPUT_DIR = "save_genshin"
output_dir = os.path.join(result_dir, OUTPUT_DIR)

trainer.model.save_pretrained(output_dir)
trainer.model.config.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

System and Env setting

System

Platform: AzureML
GPU: A100

Related package version

accelerate==1.0.1
bitsandbytes==0.42.0
deepspeed==0.15.2
openai==1.40.0
tokenizers==0.20.3
torch==2.4.0
torchvision==0.19.0
vllm==0.6.3.post1
transformers==4.45.2
trl==0.11.0
peft==0.11.0
flash-attn==2.6.2
@nuoma
Copy link
Collaborator

nuoma commented Nov 18, 2024

Sorry but I don't have a clear answer to this.

As we have discussed on Discord, there are many details in this process that could go wrong, especially considering you are using your own training framework. Here are some usual suspects we checked:

  1. You mentioned your framework works well with other models like deepseek-coder, llama3, mistral.
  2. for grad_norm nan, check if the training precision is set to bf16.
  3. quality check your dataset.
  4. consider update bitsandbytes==0.42.0 to newer version, developers have complained about similar issues. Also check other dependencies like deepspeed.
  5. Very large loss at the beginning might indicate LR is too large, and you have already tried a smaller value of 1e-5 and gradient_clipping.

Hope there are smarter minds that can answer this.

@zero90169
Copy link
Author

Thank you for your effort. I will spend more time focusing on this issue. If I have something updated, I'll post it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants