Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve extraction for non-defanged URLs #61

Closed
battleoverflow opened this issue Jan 25, 2023 · 6 comments
Closed

Improve extraction for non-defanged URLs #61

battleoverflow opened this issue Jan 25, 2023 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@battleoverflow
Copy link
Contributor

"while it seems like the bug originally referenced in this issue is fixed in the new version, the one I commented above still exists. Defanged IPs still get extracted by extract_urls while their non-defanged counterparts don't"

Issue comment: #34 (comment)

@battleoverflow battleoverflow self-assigned this Jan 25, 2023
@battleoverflow battleoverflow added the bug Something isn't working label Jan 25, 2023
@luis261
Copy link

luis261 commented Jan 25, 2023

Thanks for taking my comment into account! Hopefully this can be fixed (:

@battleoverflow
Copy link
Contributor Author

Hi, @luis261!

I finally got a second to look over the issue. Your comment was absolutely valuable, but time is unfortunately limited, so I wasn't able to really look into it until now. A solution is currently in testing and will be available in the next release. I've included a few examples with comments below.

You may notice a new parameter: defang_data. This way if you extract a URL or IP address that isn't defanged, you can immediately defang it during extraction a little easier. I still have some things to prepare before this release is ready, but I'm planning for this week. I'll make another comment on this thread once it's available for download!

import iocextract

data = [
    "1.1.1.1",
    "1[.]1[.]1[.]1",
    "domain.com",
    "domain[.]com"
]

for d in data:
    # Everything should be refanged
    print(list(iocextract.extract_urls(d, refang=True, no_scheme=True)))

    # Half should be defanged, half should be normal (defang_data defaults to false)
    print(list(iocextract.extract_urls(d, refang=False, no_scheme=True)))

    # Everything should be defanged
    print(list(iocextract.extract_urls(d, refang=False, no_scheme=True, defang_data=True)))

@luis261
Copy link

luis261 commented Feb 22, 2023

@azazelm3dj3d Alright, thanks for keeping me updated! Once the new release is out I will check out the new behavior of extract_urls

@battleoverflow
Copy link
Contributor Author

The new version is now available: https://pypi.org/project/iocextract/1.14.1/

@luis261
Copy link

luis261 commented Feb 27, 2023

Alright, I verified the behavior you wrote about in your comment. However, the fundamental issue of extract_urls pulling in IPs still exists, now it even seems to be the universal behavior (as opposed to it occuring just in certain edge cases). That is just not what I'd expect after reading the documentation, considering that extract_ips exists as well ... and extract_urls is described in the documentation as extracting URLs (IPs are not mentioned)

@battleoverflow
Copy link
Contributor Author

Definitely a good note for the future. Due to the repository not having too many outstanding issues relative to other open-source initiatives, I haven't taken much time to review the actual documentation and how thorough (or accurate) it is. I do have it on my backlog, but no issue assignment, so I just took care of that. Thank you for bringing that to my attention.

Issue: #65

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants