This is a bit out of my wheelhouse, but this feels wrong, or at least naively capable. Like it feels like this sort of reasoning leads to the kind of bugs (depending on what you use the result of the rexex for) that allow for code injection, a la the Equifax hack.
Maybe another HN poster can back me up, or explain why in fact Zalgo is mistaken and CargoCode is correct.
Either way, this sort of complexity is one reason I avoid XML like the plague and keep HTML at arm's length.
This is what I really hate about the Zalgo answer. It is instilling people some vague sense that regular expressions are somehow bad, wrong and dangerous. But without any real arguments or contexts which would allow you to evaluate if the feeling is justified.
See, this is kinda what I mean. Maybe you can detect tags with regex, but maybe you shouldn't, given the widespread but subtle differences in regex engines.
Perhaps the entire approach of "why are you trying to parse X?" Needs to be traced and re-evaluated.
Without the colon, the parser appears to be interpreting (? as "one or more instances of (", but ( is no a full expression by itself and therefore cannot be modified with a quantifier.
Maybe another HN poster can back me up, or explain why in fact Zalgo is mistaken and CargoCode is correct.
Either way, this sort of complexity is one reason I avoid XML like the plague and keep HTML at arm's length.