Hacker News new | ask | show | jobs
by wodow 4769 days ago
I understand the argument re validating email addresses passively (regex, no regex, etc.) vs actively (send an email by SMTP).

What I don't understand with this ever-repeating discussion is why the complexity has to be visible. e.g.

    > <LARGE REGEX>
    > Yeesh. Is something that complex really necessary?
Many functions are complex - we put those in libraries, pushing them under the hood, and move on.

What is so special about parsing email addresses that makes everyone invent their own solution - regex or otherwise?

2 comments

Plus a Large Regex for mail validation is not supposed to be heavily used. It's supposed to be used once at registration for example. So why would it matter if it's slow/heavy/...
Maybe it is less resource-intensive to actually send an email rather than use a heavy regex to validate the email?
I really don't think so since you're soliciting an email server while a regex is just some code that has to be run, and they are run on a tiny string (a mail is never really long).

Also it's bothering for the user, if you need mail confirmation then do it, but otherwise it should be a RULE OF THUMB to always avoid annoying user. Thus avoid mail confirmation.

This article is actually a really bad advice. I don't know why it's upvoted so much.

I'm not completely sure that I get annoyed when a web site sends me a confirmation email. It helps me know that the site indeed knows my correct email.
> What is so special about parsing email addresses that makes everyone invent their own solution - regex or otherwise?

A valid email address can contain almost anything; this makes validation via a standard parser mostly useless. As such, devlopers reach for stricter parsers out of a combination of a not comprehending the standards, feeling vague discomfort about letting 'just anything' past data validation, and misplaced concern for users that they believe can't type their own e-mail address.

Add to that the occasional business complaint from the marketing arm about bogus e-mail addresses, and you have people repeatedly solving the problem in slightly different ways, justifying their own divergences from the standard by applying the justification that nobody will use a 'weird' address anyway, and they're actually being helpful.

> misplaced concern for users that they believe can't type their own e-mail address

How is this misplaced? People screw up even the most basic of computer tasks all the time.

1) Because the solutions actually prevent some users from typing their actual e-mail address.

2) There are so many ways to get the e-mail address wrong that it's almost not worth bothering validating the few things that you can validate.

Now, here's what would be an interesting validation method that doesn't actually require sending an e-mail. It requires an RFC-compliant e-mail parser, not a regexp:

- Perform A/MX lookups on the domain part. The domain part can be an IP address, so those get a free pass.

- Connect to the returned MX, issue a MAIL FROM+RCPT TO:

  c> MAIL FROM: test@example.org
  s> 250 2.1.0 Ok
  c> RCPT TO: is_address_valid@example.com
  s> 554 5.7.1 <is_address_valid@example.com>: Relay access denied
  c> RSET [reset the transaction, no e-mail is sent]
- If you get back a permanent 5xx error, the address is invalid. If you get back a 250 Ok, the address is probably valid (it could still be a relay that allows backscatter, in which case it will allow any address on one of its configured domains). If you receive a 4xx, the address may or may not be valid -- graylisters will send 4xx, as will servers that can't currently accept e-mail, etc.

This gives you definitive failure (5xx) and almost-definitive success (250 Ok). It's a cheap DNS lookup + TCP connection that you can begin performing immediately and asynchronously when a user enters their address in a form.

... or just send the user an activation e-mail.

Hopefully that's not the SMTP syntax you're actually using.

    * There's no space between FROM: and the address in SMTP
    * Email addresses must come between angle brackets
I'd reject (give you a 5xx) that from my mail server for those reasons alone.
> Hopefully that's not the SMTP syntax you're actually using.

I typed it out live. I'm not an SMTP client and I don't have the RFCs memorized.

> I'd reject (give you a 5xx) that from my mail server for those reasons alone.

Postfix accepts it. I haven't checked the RFC to verify your concerns, but assuming they're correct, then my expectation is that postfix is liberal in what it accepts because A) it's a good idea, and B) a real mail transfer agent probably ignored those two minimal rules at some point in the past.

Postfix (and the other big receivers) will ignore it, but will send using the proper RFCs. It's still a good sign of a badly written bulk mail engine, and worth rejecting for.
That won't really work with people who mistype domains, e.g gmale.com as that domain may have catchall enabled.
So in addition, 'spell check' for likely domains. People probably don't mean to type 'gmale.com' -- but don't prevent them from doing so, if that's what they really meant.