Hacker News new | ask | show | jobs
by johnzabroski 4622 days ago
You are missing the point. Perhaps it will help if I frame the problem for you differently: scrubbing bad data, and developing policies for minimizing bad data in your system.

Not all validation needs to happen at the time of data entry.

Your "hard" regex may reject international email addresses. In fact, if your regex's input isn't converted to Punycode first, you are a fool for even attempting to use regex, because now your regex will likely fail on all IDNA inputs.

And what is your test suite going to be?

And what did you "validate" exactly? That you matched your regex? What if the e-mail isn't active, or the mailbox is full? Outlook 2013 actually has this really cool feature called MailTips that provides more advanced mailing list and e-mail address validation and warnings: http://blogs.technet.com/b/exchange/archive/2009/04/28/34073...

Suppose when you first signed up the user, they validated their email address, but now the account seems to be inactive. How do you handle that scenario? Continuous validation.

And how generally useful is your regex? What are you going to do if the email came from OCR software output, or screen scraping output? Your ERP may have the original document it was scanned from. Are you going to not store the bad e-mail address simply because you wrote some "hard" regex that rejected it? Not a straight forward question to answer, as it depends on your data model for storing addresses. You might have a column IsConfirmed.

Here is another example of "continuous validation". Validating mailing addresses. Most major e-commerce sites allow very liberal input, but scrub the data in real time or near real time, because the postal service gives discounts to companies that print "correct" address labels. "Correct" here could mean "One Post Office Square, Boston, MA, 02109" instead of "1 Post Office Sq, Boston, MA, 02109". This process is called Address Standardization, and in areas of the world with rapidly growing economies, often times Address Standardization vendors are behind, because some "streets" don't have addresses yet and aren't known to exist in any GPS system. This is common in many parts of China.

Here is another example of "continuous validation". How Google does spell checking, as compared to the "fixed validation" in Microsoft Word's spell checker.

1 comments

All these other questions are dependent on context and out of scope of the argument.

The article argument in a nutshell is that validating email is hard, so don't bother, in fact, let users submit whatever they want including javascript. Then just check for @ and send it off to your next parser in the chain, in fact get lots of 3rd party parsers for misc features and send data to them first. spend effort fixing autocomplete so users can enter data easier that you will automatically accept. I'm sure this can only improve data quality...

I can imagine that wanting to know all the stupid shit your users submit as an email is the correct solution in certain contexts, but for a majority of cases, this article is wrong in everything that it suggests. Admittedly, there is very little context given.

Perhaps the context is "I don't care about security of my users or my services, and I will run whatever 3P code on my backend that appears to do the job of making a webpage look spiffy and easy to use. Once I have 10 Million (unverified) users, you sell your spaghetti factory and it's no longer your problem."

After all that, he recommends not letting people use software without a validated email address. Too bad he never bothers saying how he would get to that point, only how he would avoid doing to work.