| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by megaduck 5830 days ago

The language of individual HTML tags is certainly regular, and trivially easy. However, the language of "matched HTML tags with junk between them" is NOT regular.

Anything that requires balanced matching is NOT parseable with standard Regular Expressions, and by not parseable I mean that you will literally have an infinite amount of bugs. Shoot me an email and I can show you the math.

Even with Perl's whiz-bang recursive not-really-regexes-regexes, it's strongly not recommended to tackle balanced matching problems like HTML or XML. It might be theoretically possible (I haven't actually checked), but your brain will leak from your ears and you probably won't get it right, no matter how smart you are.

1 comments

albertzeyer 5830 days ago

But the OP asked for parsing just for a specific type of tag (which is regular). Not an area of text between open/close tags (which is not regular).

link