-
-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: improve email regex #180
fix: improve email regex #180
Conversation
✅ Deploy Preview for valibot ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
Looks cool |
Thanks for your contribution! Can you explain with examples what your changes to the regex do and why they are necessary? |
As soon as I get a response from you, I will review the changes and if everything fits, I will merge it. |
@fabian-hiller
Just because you put FIXME comment in I just wanted to remove these FIXME comments, so I fixed the regex.
The email regex of valibot (
And I modified the email regex (
Please let me know if I can give you any other information 🙏 |
Thank you for the details. I will try to review the PR over the weekend.
Why is this necessary? |
@fabian-hiller
This is the regex valibot has now, and I don't think it's necessary. |
Zod has recently changed the email regex, we should take a look at that. Maybe the PR at Zod is similar to this one. Is there anything against using the regex from Zod? |
My humble opinion: colinhacks/zod#2157 introduced the vulnerable regex. The zod regex is not battle-tested apart from zod users. I realize zod is popular, but email regexes have been around since ages. I get the motivation behind self-writing the regex to be simpler, but why re-invent the wheel?
And last but not least StackOverflow where all the people will come to comment why the regex in the answer does not fit the spec: https://stackoverflow.com/questions/201323/how-can-i-validate-an-email-address-using-a-regular-expression. I want to highlight the HTML5 spec regex, browsers have implemented:
So if you validate a |
Thank you a lot for this hint! I will think about it and investigate the official HTML email regex. |
I have now investigated it and decided against the official HTML regex, because with this one the following email addresses are valid:
I don't think this makes sense in practice. I like the regex of this PR. I would only remove the extensions you added as they are not needed 99% of the time. |
@fabian-hiller you introduced an exponential regex (which is why the 'extensions' are needed), which can be abused for a DoS: https://devina.io/redos-checker I want to emphasize again why it does not make sense to use a handwritten regex. It might be shorter, saving a few bytes, but at what expense?
I really want to emphasize my main point again: Why hand roll a Regex, when two or even 3 trustable resources crafted a relatively short regex already, O'Reilly, Regular-Expressions.Info and even the spec authors.
When checking emails, please always think about: regexes don't send emails.
However, I do think this is a teaching issue, not something validators should discriminate (e.g. teach that an email could contain |
A small additional thing: valibot does not seem like it's using any eslint plugins. I'd strongly recommend using https://ota-meshi.github.io/eslint-plugin-regexp/rules/ as a lot can go wrong with regexes; that plugin has rules catching such DoS vulnerable regexes. And maybe also https://github.com/eslint-community/eslint-plugin-security for general security stuff that should not make it into a validator lib. |
Thanks for the hint. I had forgotten to check that for this regex. Then I currently tend to use the new regex from Zod. I understand your point, but I think that in 99% of cases this regex does what the user expects. In special cases, for example, the w3.org regex can be used with our const EmailSchema = string([regex(/^[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*$/)]); I will have a look at the ESLint plugins in the next days. If you like you can also create a PR. |
Co-authored-by: Macs Dickinson <[email protected]>
If we come up with a good name, I can also imagine adding another email validation function with an official regex. The question is whether to use w3.org or whatwg.org. Do you have an opinion? As a name we could call the function |
But what do users expect? The examples you've listed above are covered by the Regex book from O'Reilly, too. And the regex could be modified to prevent I really see literally no advantage in using that regex.
Sure. Do you have any other eslint plugins in mind you'd like to use or just the regexps one (and maybe the security one)?
specEmail + documenting it's the regex from |
Thanks again for this insight. I will review the O'Reilly regex in the next days and may switch to it. Would you still add a I am only familiar with the basics of linting plugins, so I don't have an opinion at the moment. You are welcome to make a suggestion with a PR. |
I can imagine using the regex from O'Reilly with a few changes.
I would limit the special characters
Also, I would prefer to prevent a domain name from starting and ending with a hyphen.
Then I would remove the restriction on the maximum length of the domain, because as far as I know there are now TLDs with more characters.
Feel free to share your thoughts on this. |
Fully agree, also a nice catch!
Great catch: https://developer.mozilla.org/en-US/docs/Learn/Common_questions/Web_mechanics/What_is_a_domain_name -> 63 is max length:
by using regexp-tree optimizer, I got it to:
(I was curious if using the Here is the full list of TLDs: https://data.iana.org/TLD/tlds-alpha-by-domain.txt, looks like it also contains
this one is the w3c regex: I still see value in adding a regex that allows more emails, because some people might not insert emails into SQL statements, but rather forward it to a mail client. Imho it's up to a mail client to reject wrong emails if a browser also accepts them. However, we could also add the full W3C regex instead (if we find one that is working with JS). There might be downsides of it:
So... maybe a trade-off would be using O'Reilly before and the spec part after @ ? e.g.
That also captures By the way, I think we really got a good regex now! |
Thanks again for your research. I like your result! Do you want to create a PR or should I commit the regex directly with you as co-author?
Haha, I also compared that 😅 For |
Co-author is perfectly fine for me
will do |
Co-authored-by: Jacob Groß <[email protected]>
The regex will go live with the next update. For info, I decided not to add the domain length limit, since the specification is so far away from real TLDs that I don't see any advantage in it. It also does not help to limit the total length. You have to use |
@fabian-hiller @kurtextrem @fabian-hiller |
No problem! |
Fine to me, but the max length of the TLD could be checked by the regex (via
🤝 |
Improves email regex so that all tests (including commented-out tests) in
library/src/validations/email/email.test.ts
pass.To achieve this improvement, I changed the regex logic from deny list strategy to allow list strategy, because with deny list strategy, special characters (such as Hiragana, Chinese characters, Hangul characters and so on) are so difficult to validate. I think they are too many to list.
The original allow list strategy regex logic is copied from zod, and I modified a littile bit.
https://github.com/colinhacks/zod/blob/f59be093ec21430d9f32bbcb628d7e39116adf34/src/types.ts#L567-L568