Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

about evaluation #179

Closed
haoyuwangwhy opened this issue Feb 20, 2024 · 26 comments
Closed

about evaluation #179

haoyuwangwhy opened this issue Feb 20, 2024 · 26 comments
Assignees
Labels
question Further information is requested

Comments

@haoyuwangwhy
Copy link

I train the edited model on benchmark_wiki_recent_recent_train.json data. Is the result saved after training FT_results.json the evaluation result? I compute the average of post rewrite_acc in the file, but the result is much higher than that in your survey paper (the implemented FT model). I want to know if I misunderstand something here. Thanks.

@pengzju pengzju added the question Further information is requested label Feb 20, 2024
@haoyuwangwhy
Copy link
Author

Is there anyone who can solve this issue?

@zxlzr
Copy link
Contributor

zxlzr commented Feb 21, 2024

hi, sorry for the late reply, we will check this issue asap.

@XeeKee
Copy link
Collaborator

XeeKee commented Feb 22, 2024

FT_results.json is the evaluation result. It might be due to the fact that your hyperparameters are different from ours. Besides rewrite_acc, are there significant differences in your other metric results compared to our survey paper?

@haoyuwangwhy
Copy link
Author

FT_results.json is the evaluation result. It might be due to the fact that your hyperparameters are different from ours. Besides rewrite_acc, are there significant differences in your other metric results compared to our survey paper?

Yes, others are different as well. Actually, I download and run the code without changing any hyperparameters. I run the script:
python run_knowedit_llama2.py
--editing_method=LoRA
--hparams_dir=../hparams/LoRA/llama-7b
--data_dir=./KnowEdit/benchmark_wiki_recent_recent_train.json
--datatype='recent'

And I achieve the FT_results.json. Then I load the json file and compute the averaged post rewrite_acc. Therefore, I wonder if I need to use the benchmark_wiki_recent_recent_test.json file for evaluation or not.

@XeeKee
Copy link
Collaborator

XeeKee commented Feb 22, 2024

Our experiments were conducted on the benchmark_wiki_recent_recent_test. If it's the default parameters, our FT's default parameters only fine-tuned at the 21st layer, so you should compare your results with those of FT-L.

@haoyuwangwhy
Copy link
Author

Our experiments were conducted on the benchmark_wiki_recent_recent_test. If it's the default parameters, our FT's default parameters only fine-tuned at the 21st layer, so you should compare your results with those of FT-L.

Do you mean I use the following script to train the model:
python run_knowedit_llama2.py
--editing_method=FT
--hparams_dir=../hparams/FT/llama-7b
--data_dir=./KnowEdit/benchmark_wiki_recent_recent_train.json
--datatype='recent'

and then run
python run_knowedit_llama2.py
--editing_method=FT
--hparams_dir=../hparams/FT/llama-7b
--data_dir=./KnowEdit/benchmark_wiki_recent_recent_test.json
--datatype='recent'

to do the evaluation directly?

@XeeKee
Copy link
Collaborator

XeeKee commented Feb 22, 2024

python run_knowedit_llama2.py
--editing_method=FT
--hparams_dir=../hparams/FT/llama-7b
--data_dir=./KnowEdit/benchmark_wiki_recent_recent_test.json
--datatype='recent'

Just use the script provided above.

@haoyuwangwhy
Copy link
Author

python run_knowedit_llama2.py
--editing_method=FT
--hparams_dir=../hparams/FT/llama-7b
--data_dir=./KnowEdit/benchmark_wiki_recent_recent_test.json
--datatype='recent'

Just use the script provided above.

Thank you so much!

@haoyuwangwhy
Copy link
Author

python run_knowedit_llama2.py
--editing_method=FT
--hparams_dir=../hparams/FT/llama-7b
--data_dir=./KnowEdit/benchmark_wiki_recent_recent_test.json
--datatype='recent'

Just use the script provided above.

I have tried the script. However, I got the following results with respect to "post" key: rewrite_acc=1.0, locality=0.75, portability=0.51. It seems rewrite_acc is different from your survey paper significantly. I use llama2-7b-hf from huggingface.

@XeeKee
Copy link
Collaborator

XeeKee commented Feb 23, 2024

We conducted experiments using llama-2-chat, and it is very strange that the average rewrite_acc in the dataset is 1.0. I randomly sampled 10 examples and found that rewrite_acc=0.0 occurred in all of them. I will conduct a thorough investigation to determine where the problem lies.

@haoyuwangwhy
Copy link
Author

We conducted experiments using llama-2-chat, and it is very strange that the average rewrite_acc in the dataset is 1.0. I randomly sampled 10 examples and found that rewrite_acc=0.0 occurred in all of them. I will conduct a thorough investigation to determine where the problem lies.

FT_results.json
benchmark_wiki_recent_recent_test.json
Attached is the FT_results and training dataset I use.

@XeeKee
Copy link
Collaborator

XeeKee commented Feb 23, 2024

We ran experiments on an a800 using the llama2-7b-chat model. The results we obtained are similar to those reported in the original paper. It is worth noting that our evaluation time for each sample is approximately 20 seconds. However, your time seems to be much smaller than ours. As for why the rewrite_acc is so high, we are also not sure.

@haoyuwangwhy
Copy link
Author

We ran experiments on an a800 using the llama2-7b-chat model. The results we obtained are similar to those reported in the original paper. It is worth noting that our evaluation time for each sample is approximately 20 seconds. However, your time seems to be much smaller than ours. As for why the rewrite_acc is so high, we are also not sure.

Is the dataset same as yours? And I use the following hyperparameters:
alg_name: "FT"
model_name: "./hugging_cache/llama-2-7b-chat"
device: 0

layers: [21]
num_steps: 25
batch_size: 1
max_length: 40
lr: 5e-4
weight_decay: 0
kl_factor: 0
norm_constraint: false
rewrite_module_tmp: "model.layers.{}.mlp.down_proj.weight"
layer_module_tmp: "model.layers.{}"
mlp_module_tmp: "model.layers.{}.mlp"
attn_module_tmp: "model.layers.{}.self_attn"
ln_f_module: "model.norm"
lm_head_module: "lm_head"
model_parallel: false

@XeeKee
Copy link
Collaborator

XeeKee commented Feb 23, 2024

The dataset and hyperparameters are the same as mine.

@haoyuwangwhy
Copy link
Author

The dataset and hyperparameters are the same as mine.

Interesting. I am not sure if the problem is the GPU we use. I use A6000. I can try more different methods.

@drd13
Copy link

drd13 commented Feb 23, 2024

Could this have to do with the bug in issue #173 now being fixed. Unless the survey paper was updated, I imagine that the results for FT will be noticeably different.

@haoyuwangwhy
Copy link
Author

Could this have to do with the bug in issue #173 now being fixed. Unless the survey paper was updated, I imagine that the results for FT will be noticeably different.

I think you are correct! I will run ROME to double-check it. Thanks!

@haoyuwangwhy
Copy link
Author

It seems that it is not because of the bug in issue #173. I run the ROME experiment with script:
python run_knowedit_llama2.py
--editing_method=ROME
--hparams_dir=../hparams/ROME/llama-7b-chat
--data_dir=./KnowEdit/wiki_recent/benchmark_wiki_recent_recent_test.json
--datatype='recent'
I achieved rewrite_acc=0.97, locality=0.22, portability=0.20.

Result file:
ROME_results.json
Partial running log:
run.log

@haoyuwangwhy
Copy link
Author

We ran experiments on an a800 using the llama2-7b-chat model. The results we obtained are similar to those reported in the original paper. It is worth noting that our evaluation time for each sample is approximately 20 seconds. However, your time seems to be much smaller than ours. As for why the rewrite_acc is so high, we are also not sure.

Did you use the latest version of the Repository to do the testing? I am not sure if we use the same code version.

@XeeKee
Copy link
Collaborator

XeeKee commented Feb 24, 2024

Yes, we are using different versions of the code. I have been testing with the latest version of the code recently and will notify you as soon as I have the results.

@haoyuwangwhy
Copy link
Author

Yes, we are using different versions of the code. I have been testing with the latest version of the code recently and will notify you as soon as I have the results.

Hi, have you achieved the new result?

@XeeKee
Copy link
Collaborator

XeeKee commented Feb 26, 2024

Recently, computing resources have been tight, so I only ran the structure of "recent." Indeed, there are differences.
Edit succ: 1.0 Portability: 0.6380427946836714 Locality: 0.7432813769149831.
I will gradually rerun the results of other datasets soon.

@haoyuwangwhy
Copy link
Author

Recently, computing resources have been tight, so I only ran the structure of "recent." Indeed, there are differences. Edit succ: 1.0 Portability: 0.6380427946836714 Locality: 0.7432813769149831. I will gradually rerun the results of other datasets soon.

Cool, reach consensus! So is this the correct result or results in the survey paper correct?

@XeeKee
Copy link
Collaborator

XeeKee commented Feb 26, 2024

Could you add me on WeChat or any other social media platform convenient for you? We can have a detailed discussion on it.
My WeChat ID is xi1786594371

@XeeKee
Copy link
Collaborator

XeeKee commented Feb 27, 2024

In "A Comprehensive Study of Knowledge Editing for Large Language Models," we used "prompt_last" as the optimization objective for both FT and FT-L. Please update to the latest version of the code and set the hyperparameter "objective_optimization" to "prompt_last" to obtain results consistent with the paper.

@XeeKee XeeKee closed this as completed Feb 27, 2024
@drd13
Copy link

drd13 commented Feb 27, 2024

Recently, computing resources have been tight, so I only ran the structure of "recent." Indeed, there are differences.
Edit succ: 1.0 Portability: 0.6380427946836714 Locality: 0.7432813769149831.
I will gradually rerun the results of other datasets soon.

@XeeKee. As an interested third party, can I just confirm what this means for the paper. Is the number in your message the performance of FT when objective_optimization is set to target_new on Wikidata recent. And are these numbers directly comparable to those of other methods in Table 4 of the Survey paper?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants