Crawler indexes all pages on a particular domain rather pages under a path #6

Amolith · 2022-02-18T16:34:38Z

When running Lieu over all the sites in the fediring, we've found that it's only bound by domain rather than domain+path. This causes quirks with static site hosts like cronut.cafe; the only cronut.cafe user who's also a member of the ring is ~sfr, but multiple other users who aren't members have been indexed as well: https://search.fediring.net/?q=cronut

I think a good solution might be keeping track of not only the domain that's being crawled but also the original URL and ignoring links to parent directories.

cblgh · 2022-02-20T14:45:45Z

@Amolith thanks for the issue! the thing i went with initially was the notion of filtered sites, filtering out webring domains which appeared to crowd out overall useful results. I'll look into changing things to move away from strictly using domains :)

cblgh added a commit that referenced this issue Mar 7, 2022

improve crawling rules wrt path-suffixed sites, close #6

acb6814

cblgh closed this as completed in 9f912b8 Mar 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler indexes all pages on a particular domain rather pages under a path #6

Crawler indexes all pages on a particular domain rather pages under a path #6

Amolith commented Feb 18, 2022

cblgh commented Feb 20, 2022

Crawler indexes all pages on a particular domain rather pages under a path #6

Crawler indexes all pages on a particular domain rather pages under a path #6

Comments

Amolith commented Feb 18, 2022

cblgh commented Feb 20, 2022