Hacker News new | ask | show | jobs
by rjbond3rd 5003 days ago
I'm not exactly disagreeing but just curious:

1. How can a non-RFC-compliant email be valid?

2. What about compiled regexes, performance-wise?

3. Sometimes a regex is faster than the overhead of a parser, so wouldn't the choice be dependent on context? In other words, regexes are not always slower, true?

4. Wouldn't some abstraction libraries utilize regexes under the hood? Would that be wrong in your view?

P.S. Some languages allow the option for very readable regexes, e.g. separate each component on its own line, with a comment.

2 comments

> How can a non-RFC-compliant email be valid?

甲斐@黒川.日本 is a non-RFC 5322-compliant, but still valid, email address.

Unless you're implying that "valid" === RFC 5322-compliant, in which case the example isn't valid ;)

The best way to validate an email address: send an email to that email address containing a confirmation link. Simple, easy.

Ah, understood. I was thinking "valid == well-formed" without knowing whether it really works (i.e., could be deleted), whereas I see you rightfully point out that it more reasonably means "it works." Thank you, makes sense.
Simply, some sites don't enforce the full set of RFC rules, as such people actually have non-RFC-compliant email addresses that are valid.

How can you 'compile' a regular expression?

For very simple regular expressions, they might be decently fast, but as soon as you start pulling out the more complicated regular expressions needed for parsing, you get slower. Even simple repeats can have a lot of overhead if not used correctly, have a look at "Looking Inside The Regex Engine" at this link http://www.regular-expressions.info/repeat.html. An equivalent parser doesn't need to do any form of backtracking, and doesn't care about the structure. For example, I've seen an application use regular expressions for html parsing. After spending a while figuring out what they actually did, I found the source html had changed its whitespace, but not the DOM structure, which broke the regular expressions.

As for my reasoning above, I think a lot of 'abstraction' libraries would be faster by operating directly on the data, instead of just converting it to regular expressions. The beauty of regular expressions is the speed at which they can be written.