Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is the tot_counter saved twice in this code snippe? #9

Open
haiqiang2017 opened this issue May 7, 2024 · 4 comments
Open

Is the tot_counter saved twice in this code snippe? #9

haiqiang2017 opened this issue May 7, 2024 · 4 comments

Comments

@haiqiang2017
Copy link

tot_counter = Counter()
for counter in tqdm(all_counters):
tot_counter.update(counter)

with open("/scratch/tot_image_urls_in_web_document_dataset_filtered.pickle", "wb") as f:
    pickle.dump(tot_counter, f, pickle.HIGHEST_PROTOCOL)

command_sync_s3 = (
    "aws s3 cp /scratch/tot_image_urls_in_web_document_dataset_filtered.pickle"
    " s3://m4-datasets/webdocs/tot_image_urls_in_web_document_dataset_filtered.pickle"
)
os.system(command_sync_s3)
os.system(command_sync_s3)
os.system(command_sync_s3)

tot_image_urls_in_web_document_dataset_filtered_too_duplicated = [
    k for k, v in tot_counter.items() if v > THRESHOLD_TOO_DUPLICATED
]

with open("/scratch/tot_image_urls_in_web_document_dataset_filtered_too_duplicated.pickle", "wb") as f:
    pickle.dump(tot_counter, f, pickle.HIGHEST_PROTOCOL)
   
   
   Is the tot_counter saved twice in this code snippet? And tot_image_urls_in_web_document_dataset_filtered_too_duplicated is not used,
@HugoLaurencon
Copy link
Contributor

From which file did you get this?

@haiqiang2017
Copy link
Author

[OBELICS]main/build_obelics/06_02_merge_sets_image_urls_in_webdocs.py @HugoLaurencon The code from here

@HugoLaurencon
Copy link
Contributor

Yes you should probably replace tot_counter by tot_image_urls_in_web_document_dataset_filtered_too_duplicated in the second occurrence

@haiqiang2017
Copy link
Author

thanks, I can solve the problem by this method. @HugoLaurencon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants