fix: improve email regex #180

kazizi55 · 2023-09-23T10:14:52Z

Improves email regex so that all tests (including commented-out tests) in library/src/validations/email/email.test.ts pass.

To achieve this improvement, I changed the regex logic from deny list strategy to allow list strategy, because with deny list strategy, special characters (such as Hiragana, Chinese characters, Hangul characters and so on) are so difficult to validate. I think they are too many to list.

The original allow list strategy regex logic is copied from zod, and I modified a littile bit.
https://github.com/colinhacks/zod/blob/f59be093ec21430d9f32bbcb628d7e39116adf34/src/types.ts#L567-L568

netlify · 2023-09-23T10:14:56Z

✅ Deploy Preview for valibot ready!

Name	Link
🔨 Latest commit	`2bb7428`
🔍 Latest deploy log	https://app.netlify.com/sites/valibot/deploys/651f97337d84ce0008d3d8e4
😎 Deploy Preview	https://deploy-preview-180--valibot.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

FleetAdmiralJakob · 2023-09-23T10:31:08Z

Looks cool

fabian-hiller · 2023-09-23T17:47:38Z

Thanks for your contribution! Can you explain with examples what your changes to the regex do and why they are necessary?

fabian-hiller · 2023-09-25T16:08:02Z

As soon as I get a response from you, I will review the changes and if everything fits, I will merge it.

kazizi55 · 2023-09-25T17:19:57Z

@fabian-hiller
Thank you for your quick response, and I apologize for my lack of explanation 🙏

why they are necessary?

Just because you put FIXME comment in library/src/validations/email/email.test.ts.
https://github.com/fabian-hiller/valibot/pull/180/files#diff-5749fd906e956e45be3645c6d1f4a32b5872bd88160a7364b775c6feb0f2d61eL43-L47

I just wanted to remove these FIXME comments, so I fixed the regex.

Can you explain with examples what your changes to the regex

The email regex of valibot (^(([^<>()[\].,;:\s@"]+(\.[^<>()[\].,;:\s@"]+)*)|(".+"))@(([^<>()[\].,;:\s@"]+\.)+[^<>()[\].,;:\s@"]{2,})$/i) is now like this:

user name
- There are 2 patterns
  - one
    - The characters below are prohibited at the beginning of user name
      - <>()[\].,;:\s@"
    - The characters below are prohibited in user name
      - <>()[\],;:\s@"
        
        the difference between this and above characters is .. Only . can be used in user name.
  - the other
    - If quoted by double quotation marks, everything is valid
domain name
- The characters below are prohibited before .
  - <>()[\].,;:\s@"
- Except prohibited characters, characters + . pattern can be repeated
- The characters below are prohibited after .
  - <>()[\].,;:\s@"
- there must be more than 2 characters after .

And I modified the email regex (^(“|[A-Z0-9_+-]+\.?)*[A-Z0-9_+-]”?@(([A-Z0-9][A-Z0-9-]*\.)+[A-Z]{2,}|([0-9]{3}\.[0-9]{3}\.[0-9]{3}\.[0-9]{3}))) like this:
(Since I copied the logic from Zod, I quoted the sentences from colinhack's blog post)

user name
- can use these characters: [a-zA-Z0-9-._]
- can't have two dots in a row
- can't start or end with a dot
- can use left and right double quotation marks
  - I added this logic to Zod logic so that this test can be passed
domain name
- There are 2 patterns
  - one
    - can use these characters: [a-zA-Z0-9-.]
    - must have one or more dots in it
    - can't have two dots in a row
    - can't start or end with a dot
    - can't start or end with a hyphen
    - must end with a "TLD" that's 2+ letters, only [a-zA-Z] allowed
  - the other
    - can use 3 digits * 4 pattern
      - I added this log to Zod logic, so that this test can be passed

Please let me know if I can give you any other information 🙏

fabian-hiller · 2023-09-26T15:24:37Z

Thank you for the details. I will try to review the PR over the weekend.

If quoted by double quotation marks, everything is valid

Why is this necessary?

kazizi55 · 2023-09-28T00:36:12Z

@fabian-hiller
Thank you for checking 🙏

If quoted by double quotation marks, everything is valid
Why is this necessary?

This is the regex valibot has now, and I don't think it's necessary.
So I deleted this logic in this PR.

library/src/validations/email/email.ts

fabian-hiller · 2023-10-03T20:51:06Z

Zod has recently changed the email regex, we should take a look at that. Maybe the PR at Zod is similar to this one. Is there anything against using the regex from Zod?

kurtextrem · 2023-10-04T08:11:01Z

My humble opinion: colinhacks/zod#2157 introduced the vulnerable regex. The zod regex is not battle-tested apart from zod users. I realize zod is popular, but email regexes have been around since ages. I get the motivation behind self-writing the regex to be simpler, but why re-invent the wheel?

lists enough options https://www.oreilly.com/library/view/regular-expressions-cookbook/9781449327453/ch04s01.html, e.g.

^[\w!#$%&'*+/=?`{|}~^-]+(?:\.[\w!#$%&'*+/=?`{|}~^-]+)*@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$`

https://www.regular-expressions.info/email.html I really love that site for regex tips. That is probably the most advanced regex resource out there:

We get a more practical implementation of RFC 5322 if we omit IP addresses, domain-specific addresses, the syntax using double quotes and square brackets. It will still match 99.99% of all email addresses in actual use today.

\A[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*@(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\z

And last but not least StackOverflow where all the people will come to comment why the regex in the answer does not fit the spec: https://stackoverflow.com/questions/201323/how-can-i-validate-an-email-address-using-a-regular-expression. I want to highlight the HTML5 spec regex, browsers have implemented:

^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$

So if you validate a <input type=email field, only that regex will reflect the validation of that input (see https://rgxdb.com/r/5PUY334X). And I think, especially looking back at zod#2157, makes the most sense: valibot is probably used for validating such inputs. Someone in the comments of the Zod PR also mentioned the HTML5 spec regex, so I'm guessing it was an oversight to stick with the newly crafted one. The PR also mentions why the zod regex is bad: emails like O'[email protected] will validate to false in zod.

fabian-hiller · 2023-10-05T01:28:30Z

Thank you a lot for this hint! I will think about it and investigate the official HTML email regex.

fabian-hiller · 2023-10-06T05:01:10Z

I have now investigated it and decided against the official HTML regex, because with this one the following email addresses are valid:

I don't think this makes sense in practice. I like the regex of this PR. I would only remove the extensions you added as they are not needed 99% of the time.

kurtextrem · 2023-10-06T07:16:14Z

@fabian-hiller you introduced an exponential regex (which is why the 'extensions' are needed), which can be abused for a DoS: https://devina.io/redos-checker

I want to emphasize again why it does not make sense to use a handwritten regex. It might be shorter, saving a few bytes, but at what expense?

Valid emails such as O'[email protected] are not allowed
Valid inputs from browsers are rejected server side, leading to a confusing behaviour
- I get your point of invalid emails being valid, but Chromium and other big vendors see this as ok - why don't we?
- What about introducing 2 regexes, one that fits input[email] validation and one that is closer to the spec (disallowing some of the invalid emails)

I really want to emphasize my main point again: Why hand roll a Regex, when two or even 3 trustable resources crafted a relatively short regex already, O'Reilly, Regular-Expressions.Info and even the spec authors.

/^[\w!#$%&'*+/=?`{|}~^-]+(?:\.[\w!#$%&'*+/=?`{|}~^-]+)*@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$/ // O'Reilly
/^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/ // web spec

When checking emails, please always think about: regexes don't send emails.
Regular-Expressions.info also lists regexes that don't capture ' emails:

An important trade-off in all these regexes is that they only allow English letters, digits, and the most commonly used special symbols. The main reason is that I don’t trust all my email software to be able to handle much else. Even though John.O'[email protected] is a syntactically valid email address, there’s a risk that some software will misinterpret the apostrophe as a delimiting quote. Blindly inserting this email address into an SQL query, for example, will at best cause it to fail when strings are delimited with single quotes and at worst open your site up to SQL injection attacks.

However, I do think this is a teaching issue, not something validators should discriminate (e.g. teach that an email could contain ').

kurtextrem · 2023-10-06T07:32:09Z

A small additional thing: valibot does not seem like it's using any eslint plugins. I'd strongly recommend using https://ota-meshi.github.io/eslint-plugin-regexp/rules/ as a lot can go wrong with regexes; that plugin has rules catching such DoS vulnerable regexes. And maybe also https://github.com/eslint-community/eslint-plugin-security for general security stuff that should not make it into a validator lib.

fabian-hiller · 2023-10-06T16:51:26Z

Thanks for the hint. I had forgotten to check that for this regex. Then I currently tend to use the new regex from Zod. I understand your point, but I think that in 99% of cases this regex does what the user expects. In special cases, for example, the w3.org regex can be used with our regex function:

const EmailSchema = string([regex(/^[a-zA-Z0-9.!#$%&’*+/=?^_`{|}~-]+@[a-zA-Z0-9-]+(?:\.[a-zA-Z0-9-]+)*$/)]);

I will have a look at the ESLint plugins in the next days. If you like you can also create a PR.

Co-authored-by: Macs Dickinson <[email protected]>

fabian-hiller · 2023-10-06T17:00:40Z

If we come up with a good name, I can also imagine adding another email validation function with an official regex. The question is whether to use w3.org or whatwg.org. Do you have an opinion? As a name we could call the function specEmail. Do you have ideas for other names?

kurtextrem · 2023-10-06T18:45:21Z

I think that in 99% of cases this regex does what the user expects

But what do users expect? The examples you've listed above are covered by the Regex book from O'Reilly, too. And the regex could be modified to prevent ', if that is your concern. The difference of both regexes is 5 bytes, and the O'Reilly regex has no negative lookaheads, which leads to almost 50% higher performance: https://esbench.com/bench/652054377ff73700a4deba38 (latest chrome dev v120, V8 11.9.146.1)

I really see literally no advantage in using that regex.

I will have a look at the ESLint plugins in the next days. If you like you can also create a PR.

Sure. Do you have any other eslint plugins in mind you'd like to use or just the regexps one (and maybe the security one)?

As a name we could call the function specEmail. Do you have ideas for other names?

specEmail + documenting it's the regex from input[type=email] sounds good to me!

fabian-hiller · 2023-10-06T19:00:56Z

Thanks again for this insight. I will review the O'Reilly regex in the next days and may switch to it. Would you still add a specEmail function then? And with input[type=email] you mean the shorter W3C regex or?

I am only familiar with the basics of linting plugins, so I don't have an opinion at the moment. You are welcome to make a suggestion with a PR.

fabian-hiller · 2023-10-07T04:39:27Z

I can imagine using the regex from O'Reilly with a few changes.

/^[\w!#$%&'*+/=?`{|}~^-]+(?:\.[\w!#$%&'*+/=?`{|}~^-]+)*@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$/i

I would limit the special characters !#$%&'*+/=?`{|}~^- to meaningful characters +-.

/^[\w+-]+(?:\.[\w+-]+)*@(?:[A-Z0-9-]+\.)+[A-Z]{2,6}$/i

Also, I would prefer to prevent a domain name from starting and ending with a hyphen.

/^[\w+-]+(?:\.[\w+-]+)*@[A-Z0-9]+(?:(?:\.|-)[A-Z0-9]+)*\.[A-Z]{2,6}$/i

Then I would remove the restriction on the maximum length of the domain, because as far as I know there are now TLDs with more characters.

/^[\w+-]+(?:\.[\w+-]+)*@[A-Z0-9]+(?:(?:\.|-)[A-Z0-9]+)*\.[A-Z]{2,}$/i

Feel free to share your thoughts on this.

kurtextrem · 2023-10-07T07:10:53Z

Also, I would prefer to prevent a domain name from starting and ending with a hyphen.

Fully agree, also a nice catch!

Then I would remove the restriction on the maximum length of the domain, because as far as I know there are now TLDs with more characters.

Great catch: https://developer.mozilla.org/en-US/docs/Learn/Common_questions/Web_mechanics/What_is_a_domain_name -> 63 is max length:

/^[\w+-]+(?:\.[\w+-]+)*@[A-Z0-9]+(?:(?:\.|-)[A-Z0-9]+)*\.[A-Z]{2,63}$/i

by using regexp-tree optimizer, I got it to:

/^[\w+-]+(?:\.[\w+-]+)*@[\da-z]+(?:[.-][\da-z]+)*\.[a-z]{2,63}$/i

(I was curious if using the i mod makes the regex slower, faster or equal compared to typing out [a-zA-Z] and the result is equal, including regex parsing cost)

Here is the full list of TLDs: https://data.iana.org/TLD/tlds-alpha-by-domain.txt, looks like it also contains XN--11B4C3D (कॉम]), but I'm not sure if those domains are really spread (I'd say we can ignore it, but we should be aware that we do)

Would you still add a specEmail function then? And with input[type=email] you mean the shorter W3C regex or?

this one is the w3c regex:
/^[a-zA-Z0-9.!#$%&'*+/=?^_`{|}~-]+@[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?(?:\.[a-zA-Z0-9](?:[a-zA-Z0-9-]{0,61}[a-zA-Z0-9])?)*$/

I still see value in adding a regex that allows more emails, because some people might not insert emails into SQL statements, but rather forward it to a mail client. Imho it's up to a mail client to reject wrong emails if a browser also accepts them. However, we could also add the full W3C regex instead (if we find one that is working with JS). There might be downsides of it:

The reason you shouldn’t use this regex is that it is overly broad. Your application may not be able to handle all email addresses this regex allows. Domain-specific routing addresses can contain non-printable ASCII control characters, which can cause trouble if your application needs to display addresses. Not all applications support the syntax for the local part using double quotes or square brackets. In fact, RFC 5322 itself marks the notation using square brackets as obsolete.

So... maybe a trade-off would be using O'Reilly before and the spec part after @ ? e.g.

/^[\w!#$%&'*+/=?^{|}~-]+(?:\.[\w!#$%&'*+/=?^{|}~-]+)*@[\da-z](?:[\da-z-]{0,61}[\da-z])?(?:\.[\da-z](?:[\da-z-]{0,61}[\da-z])?)*$/i

That also captures [email protected] and disallows emails starting with ., and domains starting with -. So we've successfully reduced false-positives coming in from browsers and also are closer to the spec as people can use O'[email protected] as email.

By the way, I think we really got a good regex now!

fabian-hiller · 2023-10-07T15:34:51Z

Thanks again for your research. I like your result! Do you want to create a PR or should I commit the regex directly with you as co-author?

(I was curious if using the i mod makes the regex slower, faster or equal compared to typing out [a-zA-Z] and the result is equal, including regex parsing cost)

Haha, I also compared that 😅

For specEmail, I recommend creating an issue or PR with all the information and waiting for others to respond to see if there is interest in it before we merge it.

kurtextrem · 2023-10-07T19:01:02Z

Thanks again for your research. I like your result! Do you want to create a PR or should I commit the regex directly with you as co-author?

Co-author is perfectly fine for me

For specEmail, I recommend creating an issue or PR with all the information and waiting for others to respond to see if there is interest in it before we merge it.

will do

Co-authored-by: Jacob Groß <[email protected]>

fabian-hiller · 2023-10-08T02:06:31Z

The regex will go live with the next update. For info, I decided not to add the domain length limit, since the specification is so far away from real TLDs that I don't see any advantage in it. It also does not help to limit the total length. You have to use maxLength for that.

kazizi55 · 2023-10-08T06:02:00Z

@fabian-hiller @kurtextrem
Thank you very much for improving my email regex!!

@fabian-hiller
And sorry that I couldn't reply to your comments quickly, because I was busy currently 🙏

fabian-hiller · 2023-10-08T06:03:17Z

No problem!

kurtextrem · 2023-10-08T08:08:47Z

For info, I decided not to add the domain length limit, since the specification is so far away from real TLDs that I don't see any advantage in it. It also does not help to limit the total length. You have to use maxLength for that.

Fine to me, but the max length of the TLD could be checked by the regex (via $ at the end). I think for max length of the email, that's probably another topic (and shouldn't be done by a regex)

Thank you very much for improving my email regex!!

🤝

kazizi55 added 3 commits September 23, 2023 18:45

fix: copy email regex from zod

d3d8665

fix: allow left & right double quotation mark in email regex

6d0a0c1

fix: allow 3 digits * 4 pattern in email regex

c01ae77

kazizi55 marked this pull request as ready for review September 23, 2023 10:15

fabian-hiller self-assigned this Sep 23, 2023

fabian-hiller added the enhancement New feature or request label Sep 23, 2023

fabian-hiller reviewed Sep 28, 2023

View reviewed changes

library/src/validations/email/email.ts Outdated Show resolved Hide resolved

Simplify email regex and improve tests

2bb7428

fabian-hiller merged commit 3ec0c7f into fabian-hiller:main Oct 6, 2023

fabian-hiller added a commit that referenced this pull request Oct 6, 2023

Change email regex to new one from Zod #180

8bf2e93

Co-authored-by: Macs Dickinson <[email protected]>

fabian-hiller added a commit that referenced this pull request Oct 7, 2023

Change email regex to a handcrafted one #180

8ad42e7

fabian-hiller mentioned this pull request Oct 7, 2023

[2609] fix ReDoS vulnerability in email regex colinhacks/zod#2824

Merged

kurtextrem mentioned this pull request Oct 7, 2023

email validation: provide spec validation #204

Closed

fabian-hiller added a commit that referenced this pull request Oct 8, 2023

Improve handcrafted email regex #180

d255ac0

Co-authored-by: Jacob Groß <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve email regex #180

fix: improve email regex #180

kazizi55 commented Sep 23, 2023

netlify bot commented Sep 23, 2023 •

edited

Loading

FleetAdmiralJakob commented Sep 23, 2023

fabian-hiller commented Sep 23, 2023

fabian-hiller commented Sep 25, 2023

kazizi55 commented Sep 25, 2023

fabian-hiller commented Sep 26, 2023

kazizi55 commented Sep 28, 2023

fabian-hiller commented Oct 3, 2023

kurtextrem commented Oct 4, 2023 •

edited

Loading

fabian-hiller commented Oct 5, 2023

fabian-hiller commented Oct 6, 2023

kurtextrem commented Oct 6, 2023 •

edited

Loading

kurtextrem commented Oct 6, 2023

fabian-hiller commented Oct 6, 2023

fabian-hiller commented Oct 6, 2023

kurtextrem commented Oct 6, 2023 •

edited

Loading

fabian-hiller commented Oct 6, 2023

fabian-hiller commented Oct 7, 2023

kurtextrem commented Oct 7, 2023 •

edited

Loading

fabian-hiller commented Oct 7, 2023

kurtextrem commented Oct 7, 2023

fabian-hiller commented Oct 8, 2023

kazizi55 commented Oct 8, 2023

fabian-hiller commented Oct 8, 2023

kurtextrem commented Oct 8, 2023 •

edited

Loading

fix: improve email regex #180

fix: improve email regex #180

Conversation

kazizi55 commented Sep 23, 2023

netlify bot commented Sep 23, 2023 • edited Loading

✅ Deploy Preview for valibot ready!

FleetAdmiralJakob commented Sep 23, 2023

fabian-hiller commented Sep 23, 2023

fabian-hiller commented Sep 25, 2023

kazizi55 commented Sep 25, 2023

fabian-hiller commented Sep 26, 2023

kazizi55 commented Sep 28, 2023

fabian-hiller commented Oct 3, 2023

kurtextrem commented Oct 4, 2023 • edited Loading

fabian-hiller commented Oct 5, 2023

fabian-hiller commented Oct 6, 2023

kurtextrem commented Oct 6, 2023 • edited Loading

kurtextrem commented Oct 6, 2023

fabian-hiller commented Oct 6, 2023

fabian-hiller commented Oct 6, 2023

kurtextrem commented Oct 6, 2023 • edited Loading

fabian-hiller commented Oct 6, 2023

fabian-hiller commented Oct 7, 2023

kurtextrem commented Oct 7, 2023 • edited Loading

fabian-hiller commented Oct 7, 2023

kurtextrem commented Oct 7, 2023

fabian-hiller commented Oct 8, 2023

kazizi55 commented Oct 8, 2023

fabian-hiller commented Oct 8, 2023

kurtextrem commented Oct 8, 2023 • edited Loading

netlify bot commented Sep 23, 2023 •

edited

Loading

kurtextrem commented Oct 4, 2023 •

edited

Loading

kurtextrem commented Oct 6, 2023 •

edited

Loading

kurtextrem commented Oct 6, 2023 •

edited

Loading

kurtextrem commented Oct 7, 2023 •

edited

Loading

kurtextrem commented Oct 8, 2023 •

edited

Loading