Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler indexes all pages on a particular domain rather pages under a path #6

Closed
Amolith opened this issue Feb 18, 2022 · 1 comment

Comments

@Amolith
Copy link

Amolith commented Feb 18, 2022

When running Lieu over all the sites in the fediring, we've found that it's only bound by domain rather than domain+path. This causes quirks with static site hosts like cronut.cafe; the only cronut.cafe user who's also a member of the ring is ~sfr, but multiple other users who aren't members have been indexed as well: https://search.fediring.net/?q=cronut

I think a good solution might be keeping track of not only the domain that's being crawled but also the original URL and ignoring links to parent directories.

@cblgh
Copy link
Owner

cblgh commented Feb 20, 2022

@Amolith thanks for the issue! the thing i went with initially was the notion of filtered sites, filtering out webring domains which appeared to crowd out overall useful results. I'll look into changing things to move away from strictly using domains :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants