| I disagree with that takeaway. Technically PCRE regexes are powerful and can match anything. In reality, a complex PCRE regex will almost always be more difficult to maintain than a parser-combinator or hand-rolled parser in a more traditional language. People saying "Regexes can't match HTML, use an html library" are wrong to say regexes are incapable of it, but they're right to say to use a library meant for the job. Sure, you can use this regex to match emails (http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html), but using a more normal parsing language than PCRE regex will result in more readable code. The same is true for almost any regular expression that takes advantage of PCRE features, especially backreferences. In addition, a regexp will only match html correctly if you write a very complex one. With a naive regexp for an html tag's contents, you'll find that you still might match that text inside a <script> tag even though that is not html.. so you now need to figure out when you're in a script tag and exclude that, or if you're inside an html attribute string, and before you know it you have a 2000 character regexp that no one else will be able to read, all because you didn't want to use an html parsing library where getting a tag's value correctly would be a single xpath expression or css selector away. |