Hacker News new | ask | show | jobs
by anonyfuss 4769 days ago
> What is so special about parsing email addresses that makes everyone invent their own solution - regex or otherwise?

A valid email address can contain almost anything; this makes validation via a standard parser mostly useless. As such, devlopers reach for stricter parsers out of a combination of a not comprehending the standards, feeling vague discomfort about letting 'just anything' past data validation, and misplaced concern for users that they believe can't type their own e-mail address.

Add to that the occasional business complaint from the marketing arm about bogus e-mail addresses, and you have people repeatedly solving the problem in slightly different ways, justifying their own divergences from the standard by applying the justification that nobody will use a 'weird' address anyway, and they're actually being helpful.

1 comments

> misplaced concern for users that they believe can't type their own e-mail address

How is this misplaced? People screw up even the most basic of computer tasks all the time.

1) Because the solutions actually prevent some users from typing their actual e-mail address.

2) There are so many ways to get the e-mail address wrong that it's almost not worth bothering validating the few things that you can validate.

Now, here's what would be an interesting validation method that doesn't actually require sending an e-mail. It requires an RFC-compliant e-mail parser, not a regexp:

- Perform A/MX lookups on the domain part. The domain part can be an IP address, so those get a free pass.

- Connect to the returned MX, issue a MAIL FROM+RCPT TO:

  c> MAIL FROM: test@example.org
  s> 250 2.1.0 Ok
  c> RCPT TO: is_address_valid@example.com
  s> 554 5.7.1 <is_address_valid@example.com>: Relay access denied
  c> RSET [reset the transaction, no e-mail is sent]
- If you get back a permanent 5xx error, the address is invalid. If you get back a 250 Ok, the address is probably valid (it could still be a relay that allows backscatter, in which case it will allow any address on one of its configured domains). If you receive a 4xx, the address may or may not be valid -- graylisters will send 4xx, as will servers that can't currently accept e-mail, etc.

This gives you definitive failure (5xx) and almost-definitive success (250 Ok). It's a cheap DNS lookup + TCP connection that you can begin performing immediately and asynchronously when a user enters their address in a form.

... or just send the user an activation e-mail.

Hopefully that's not the SMTP syntax you're actually using.

    * There's no space between FROM: and the address in SMTP
    * Email addresses must come between angle brackets
I'd reject (give you a 5xx) that from my mail server for those reasons alone.
> Hopefully that's not the SMTP syntax you're actually using.

I typed it out live. I'm not an SMTP client and I don't have the RFCs memorized.

> I'd reject (give you a 5xx) that from my mail server for those reasons alone.

Postfix accepts it. I haven't checked the RFC to verify your concerns, but assuming they're correct, then my expectation is that postfix is liberal in what it accepts because A) it's a good idea, and B) a real mail transfer agent probably ignored those two minimal rules at some point in the past.

Postfix (and the other big receivers) will ignore it, but will send using the proper RFCs. It's still a good sign of a badly written bulk mail engine, and worth rejecting for.
> It's still a good sign of a badly written bulk mail engine, and worth rejecting for.

No, it might be worth scoring the e-mail with a spam filter, but the MTA shouldn't be overzealously throwing away e-mail.

That won't really work with people who mistype domains, e.g gmale.com as that domain may have catchall enabled.
So in addition, 'spell check' for likely domains. People probably don't mean to type 'gmale.com' -- but don't prevent them from doing so, if that's what they really meant.