Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: escape bs4 parsing error #271

Merged
merged 1 commit into from
Jul 15, 2023
Merged

Conversation

cachho
Copy link
Contributor

@cachho cachho commented Jul 14, 2023

I have previously spoken up against escaping errors. But this is different with bulk jobs like the sitemap. You don't want the whole sitemap to stop processing because one site failed.

This error escapes beautiful soup's ParserRejectedMarkup error. I found this happens when a file attachment page made it to the sitemap, that's just the attachment and not an html file.

Example output:

2023-07-15 00:50:30,466 [root] [ERROR] Failed to parse https://example.com/verify.png: The markup you provided was rejected by the parser. Trying a different parser or a different encoding may help.

Original exception(s) from parser:
 unknown status keyword 'h' in marked section

@taranjeet taranjeet merged commit cd0c7bc into mem0ai:main Jul 15, 2023
@taranjeet
Copy link
Member

Good catch @cachho
cc @aaishikdutta

@aaishikdutta
Copy link
Contributor

Good catch @cachho

@cachho cachho deleted the fix/SitemapParsingError branch July 15, 2023 08:33
cachho added a commit to cachho/embedchain that referenced this pull request Jul 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants