-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
about evaluation #179
Comments
Is there anyone who can solve this issue? |
hi, sorry for the late reply, we will check this issue asap. |
FT_results.json is the evaluation result. It might be due to the fact that your hyperparameters are different from ours. Besides rewrite_acc, are there significant differences in your other metric results compared to our survey paper? |
Yes, others are different as well. Actually, I download and run the code without changing any hyperparameters. I run the script: |
Our experiments were conducted on the benchmark_wiki_recent_recent_test. If it's the default parameters, our FT's default parameters only fine-tuned at the 21st layer, so you should compare your results with those of FT-L. |
Do you mean I use the following script to train the model: |
Just use the script provided above. |
Thank you so much! |
I have tried the script. However, I got the following results with respect to "post" key: rewrite_acc=1.0, locality=0.75, portability=0.51. It seems rewrite_acc is different from your survey paper significantly. I use llama2-7b-hf from huggingface. |
We conducted experiments using llama-2-chat, and it is very strange that the average rewrite_acc in the dataset is 1.0. I randomly sampled 10 examples and found that rewrite_acc=0.0 occurred in all of them. I will conduct a thorough investigation to determine where the problem lies. |
FT_results.json |
We ran experiments on an a800 using the llama2-7b-chat model. The results we obtained are similar to those reported in the original paper. It is worth noting that our evaluation time for each sample is approximately 20 seconds. However, your time seems to be much smaller than ours. As for why the rewrite_acc is so high, we are also not sure. |
Is the dataset same as yours? And I use the following hyperparameters: layers: [21] |
The dataset and hyperparameters are the same as mine. |
Interesting. I am not sure if the problem is the GPU we use. I use A6000. I can try more different methods. |
Could this have to do with the bug in issue #173 now being fixed. Unless the survey paper was updated, I imagine that the results for FT will be noticeably different. |
I think you are correct! I will run ROME to double-check it. Thanks! |
It seems that it is not because of the bug in issue #173. I run the ROME experiment with script: Result file: |
Did you use the latest version of the Repository to do the testing? I am not sure if we use the same code version. |
Yes, we are using different versions of the code. I have been testing with the latest version of the code recently and will notify you as soon as I have the results. |
Hi, have you achieved the new result? |
Recently, computing resources have been tight, so I only ran the structure of "recent." Indeed, there are differences. |
Cool, reach consensus! So is this the correct result or results in the survey paper correct? |
Could you add me on WeChat or any other social media platform convenient for you? We can have a detailed discussion on it. |
In "A Comprehensive Study of Knowledge Editing for Large Language Models," we used "prompt_last" as the optimization objective for both FT and FT-L. Please update to the latest version of the code and set the hyperparameter "objective_optimization" to "prompt_last" to obtain results consistent with the paper. |
@XeeKee. As an interested third party, can I just confirm what this means for the paper. Is the number in your message the performance of FT when |
I train the edited model on benchmark_wiki_recent_recent_train.json data. Is the result saved after training FT_results.json the evaluation result? I compute the average of post rewrite_acc in the file, but the result is much higher than that in your survey paper (the implemented FT model). I want to know if I misunderstand something here. Thanks.
The text was updated successfully, but these errors were encountered: