about evaluation #179

haoyuwangwhy · 2024-02-20T13:54:40Z

I train the edited model on benchmark_wiki_recent_recent_train.json data. Is the result saved after training FT_results.json the evaluation result? I compute the average of post rewrite_acc in the file, but the result is much higher than that in your survey paper (the implemented FT model). I want to know if I misunderstand something here. Thanks.

haoyuwangwhy · 2024-02-21T15:47:36Z

Is there anyone who can solve this issue?

zxlzr · 2024-02-21T15:49:09Z

hi, sorry for the late reply, we will check this issue asap.

XeeKee · 2024-02-22T03:32:33Z

FT_results.json is the evaluation result. It might be due to the fact that your hyperparameters are different from ours. Besides rewrite_acc, are there significant differences in your other metric results compared to our survey paper?

haoyuwangwhy · 2024-02-22T03:41:21Z

FT_results.json is the evaluation result. It might be due to the fact that your hyperparameters are different from ours. Besides rewrite_acc, are there significant differences in your other metric results compared to our survey paper?

Yes, others are different as well. Actually, I download and run the code without changing any hyperparameters. I run the script:
python run_knowedit_llama2.py
--editing_method=LoRA
--hparams_dir=../hparams/LoRA/llama-7b
--data_dir=./KnowEdit/benchmark_wiki_recent_recent_train.json
--datatype='recent'
And I achieve the FT_results.json. Then I load the json file and compute the averaged post rewrite_acc. Therefore, I wonder if I need to use the benchmark_wiki_recent_recent_test.json file for evaluation or not.

XeeKee · 2024-02-22T03:55:17Z

Our experiments were conducted on the benchmark_wiki_recent_recent_test. If it's the default parameters, our FT's default parameters only fine-tuned at the 21st layer, so you should compare your results with those of FT-L.

haoyuwangwhy · 2024-02-22T04:11:44Z

Our experiments were conducted on the benchmark_wiki_recent_recent_test. If it's the default parameters, our FT's default parameters only fine-tuned at the 21st layer, so you should compare your results with those of FT-L.

Do you mean I use the following script to train the model:
python run_knowedit_llama2.py
--editing_method=FT
--hparams_dir=../hparams/FT/llama-7b
--data_dir=./KnowEdit/benchmark_wiki_recent_recent_train.json
--datatype='recent'
and then run
python run_knowedit_llama2.py
--editing_method=FT
--hparams_dir=../hparams/FT/llama-7b
--data_dir=./KnowEdit/benchmark_wiki_recent_recent_test.json
--datatype='recent'
to do the evaluation directly?

XeeKee · 2024-02-22T04:38:41Z

python run_knowedit_llama2.py
--editing_method=FT
--hparams_dir=../hparams/FT/llama-7b
--data_dir=./KnowEdit/benchmark_wiki_recent_recent_test.json
--datatype='recent'

Just use the script provided above.

haoyuwangwhy · 2024-02-22T04:40:41Z

python run_knowedit_llama2.py
--editing_method=FT
--hparams_dir=../hparams/FT/llama-7b
--data_dir=./KnowEdit/benchmark_wiki_recent_recent_test.json
--datatype='recent'

Just use the script provided above.

Thank you so much!

haoyuwangwhy · 2024-02-22T19:52:06Z

python run_knowedit_llama2.py
--editing_method=FT
--hparams_dir=../hparams/FT/llama-7b
--data_dir=./KnowEdit/benchmark_wiki_recent_recent_test.json
--datatype='recent'

Just use the script provided above.

I have tried the script. However, I got the following results with respect to "post" key: rewrite_acc=1.0, locality=0.75, portability=0.51. It seems rewrite_acc is different from your survey paper significantly. I use llama2-7b-hf from huggingface.

XeeKee · 2024-02-23T07:34:56Z

We conducted experiments using llama-2-chat, and it is very strange that the average rewrite_acc in the dataset is 1.0. I randomly sampled 10 examples and found that rewrite_acc=0.0 occurred in all of them. I will conduct a thorough investigation to determine where the problem lies.

haoyuwangwhy · 2024-02-23T13:36:54Z

We conducted experiments using llama-2-chat, and it is very strange that the average rewrite_acc in the dataset is 1.0. I randomly sampled 10 examples and found that rewrite_acc=0.0 occurred in all of them. I will conduct a thorough investigation to determine where the problem lies.

FT_results.json
benchmark_wiki_recent_recent_test.json
Attached is the FT_results and training dataset I use.

XeeKee · 2024-02-23T14:22:00Z

We ran experiments on an a800 using the llama2-7b-chat model. The results we obtained are similar to those reported in the original paper. It is worth noting that our evaluation time for each sample is approximately 20 seconds. However, your time seems to be much smaller than ours. As for why the rewrite_acc is so high, we are also not sure.

haoyuwangwhy · 2024-02-23T14:54:52Z

We ran experiments on an a800 using the llama2-7b-chat model. The results we obtained are similar to those reported in the original paper. It is worth noting that our evaluation time for each sample is approximately 20 seconds. However, your time seems to be much smaller than ours. As for why the rewrite_acc is so high, we are also not sure.

Is the dataset same as yours? And I use the following hyperparameters:
alg_name: "FT"
model_name: "./hugging_cache/llama-2-7b-chat"
device: 0

layers: [21]
num_steps: 25
batch_size: 1
max_length: 40
lr: 5e-4
weight_decay: 0
kl_factor: 0
norm_constraint: false
rewrite_module_tmp: "model.layers.{}.mlp.down_proj.weight"
layer_module_tmp: "model.layers.{}"
mlp_module_tmp: "model.layers.{}.mlp"
attn_module_tmp: "model.layers.{}.self_attn"
ln_f_module: "model.norm"
lm_head_module: "lm_head"
model_parallel: false

XeeKee · 2024-02-23T15:15:33Z

The dataset and hyperparameters are the same as mine.

haoyuwangwhy · 2024-02-23T15:30:57Z

The dataset and hyperparameters are the same as mine.

Interesting. I am not sure if the problem is the GPU we use. I use A6000. I can try more different methods.

drd13 · 2024-02-23T16:01:25Z

Could this have to do with the bug in issue #173 now being fixed. Unless the survey paper was updated, I imagine that the results for FT will be noticeably different.

haoyuwangwhy · 2024-02-23T16:04:51Z

Could this have to do with the bug in issue #173 now being fixed. Unless the survey paper was updated, I imagine that the results for FT will be noticeably different.

I think you are correct! I will run ROME to double-check it. Thanks!

haoyuwangwhy · 2024-02-24T00:06:35Z

It seems that it is not because of the bug in issue #173. I run the ROME experiment with script:
python run_knowedit_llama2.py
--editing_method=ROME
--hparams_dir=../hparams/ROME/llama-7b-chat
--data_dir=./KnowEdit/wiki_recent/benchmark_wiki_recent_recent_test.json
--datatype='recent'
I achieved rewrite_acc=0.97, locality=0.22, portability=0.20.

Result file:
ROME_results.json
Partial running log:
run.log

haoyuwangwhy · 2024-02-24T15:38:22Z

We ran experiments on an a800 using the llama2-7b-chat model. The results we obtained are similar to those reported in the original paper. It is worth noting that our evaluation time for each sample is approximately 20 seconds. However, your time seems to be much smaller than ours. As for why the rewrite_acc is so high, we are also not sure.

Did you use the latest version of the Repository to do the testing? I am not sure if we use the same code version.

XeeKee · 2024-02-24T16:19:35Z

Yes, we are using different versions of the code. I have been testing with the latest version of the code recently and will notify you as soon as I have the results.

haoyuwangwhy · 2024-02-26T13:42:38Z

Yes, we are using different versions of the code. I have been testing with the latest version of the code recently and will notify you as soon as I have the results.

Hi, have you achieved the new result?

XeeKee · 2024-02-26T14:12:03Z

Recently, computing resources have been tight, so I only ran the structure of "recent." Indeed, there are differences.
Edit succ: 1.0 Portability: 0.6380427946836714 Locality: 0.7432813769149831.
I will gradually rerun the results of other datasets soon.

haoyuwangwhy · 2024-02-26T15:03:39Z

Recently, computing resources have been tight, so I only ran the structure of "recent." Indeed, there are differences. Edit succ: 1.0 Portability: 0.6380427946836714 Locality: 0.7432813769149831. I will gradually rerun the results of other datasets soon.

Cool, reach consensus! So is this the correct result or results in the survey paper correct?

XeeKee · 2024-02-26T15:26:14Z

Could you add me on WeChat or any other social media platform convenient for you? We can have a detailed discussion on it.
My WeChat ID is xi1786594371

XeeKee · 2024-02-27T07:21:56Z

In "A Comprehensive Study of Knowledge Editing for Large Language Models," we used "prompt_last" as the optimization objective for both FT and FT-L. Please update to the latest version of the code and set the hyperparameter "objective_optimization" to "prompt_last" to obtain results consistent with the paper.

drd13 · 2024-02-27T17:25:59Z

Recently, computing resources have been tight, so I only ran the structure of "recent." Indeed, there are differences.
Edit succ: 1.0 Portability: 0.6380427946836714 Locality: 0.7432813769149831.
I will gradually rerun the results of other datasets soon.

@XeeKee. As an interested third party, can I just confirm what this means for the paper. Is the number in your message the performance of FT when objective_optimization is set to target_new on Wikidata recent. And are these numbers directly comparable to those of other methods in Table 4 of the Survey paper?

pengzju added the question Further information is requested label Feb 20, 2024

pengzju assigned XeeKee Feb 20, 2024

XeeKee closed this as completed Feb 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

about evaluation #179

about evaluation #179

haoyuwangwhy commented Feb 20, 2024

haoyuwangwhy commented Feb 21, 2024

zxlzr commented Feb 21, 2024

XeeKee commented Feb 22, 2024

haoyuwangwhy commented Feb 22, 2024

XeeKee commented Feb 22, 2024

haoyuwangwhy commented Feb 22, 2024

XeeKee commented Feb 22, 2024

haoyuwangwhy commented Feb 22, 2024

haoyuwangwhy commented Feb 22, 2024

XeeKee commented Feb 23, 2024

haoyuwangwhy commented Feb 23, 2024

XeeKee commented Feb 23, 2024

haoyuwangwhy commented Feb 23, 2024

XeeKee commented Feb 23, 2024

haoyuwangwhy commented Feb 23, 2024

drd13 commented Feb 23, 2024

haoyuwangwhy commented Feb 23, 2024

haoyuwangwhy commented Feb 24, 2024

haoyuwangwhy commented Feb 24, 2024

XeeKee commented Feb 24, 2024

haoyuwangwhy commented Feb 26, 2024

XeeKee commented Feb 26, 2024

haoyuwangwhy commented Feb 26, 2024

XeeKee commented Feb 26, 2024

XeeKee commented Feb 27, 2024

drd13 commented Feb 27, 2024

about evaluation #179

about evaluation #179

Comments

haoyuwangwhy commented Feb 20, 2024

haoyuwangwhy commented Feb 21, 2024

zxlzr commented Feb 21, 2024

XeeKee commented Feb 22, 2024

haoyuwangwhy commented Feb 22, 2024

XeeKee commented Feb 22, 2024

haoyuwangwhy commented Feb 22, 2024

XeeKee commented Feb 22, 2024

haoyuwangwhy commented Feb 22, 2024

haoyuwangwhy commented Feb 22, 2024

XeeKee commented Feb 23, 2024

haoyuwangwhy commented Feb 23, 2024

XeeKee commented Feb 23, 2024

haoyuwangwhy commented Feb 23, 2024

XeeKee commented Feb 23, 2024

haoyuwangwhy commented Feb 23, 2024

drd13 commented Feb 23, 2024

haoyuwangwhy commented Feb 23, 2024

haoyuwangwhy commented Feb 24, 2024

haoyuwangwhy commented Feb 24, 2024

XeeKee commented Feb 24, 2024

haoyuwangwhy commented Feb 26, 2024

XeeKee commented Feb 26, 2024

haoyuwangwhy commented Feb 26, 2024

XeeKee commented Feb 26, 2024

XeeKee commented Feb 27, 2024

drd13 commented Feb 27, 2024