Hacker News new | ask | show | jobs
by unlinkr 2477 days ago
Here is the answer to the question: https://www.cargocultcode.com/solving-the-zalgo-regex/

tl;dr: It can indeed be solved relatively easily with a regex.

2 comments

This is a bit out of my wheelhouse, but this feels wrong, or at least naively capable. Like it feels like this sort of reasoning leads to the kind of bugs (depending on what you use the result of the rexex for) that allow for code injection, a la the Equifax hack.

Maybe another HN poster can back me up, or explain why in fact Zalgo is mistaken and CargoCode is correct.

Either way, this sort of complexity is one reason I avoid XML like the plague and keep HTML at arm's length.

CargoCode is correct. Zalgo simply misread the question because he was so sick of similar subtly different questions.
This is what I really hate about the Zalgo answer. It is instilling people some vague sense that regular expressions are somehow bad, wrong and dangerous. But without any real arguments or contexts which would allow you to evaluate if the feeling is justified.
It doesn't work for me with regex101. "The preceding token is not quantifiable" on this part:

  | < (? \w+ )
See, this is kinda what I mean. Maybe you can detect tags with regex, but maybe you shouldn't, given the widespread but subtle differences in regex engines.

Perhaps the entire approach of "why are you trying to parse X?" Needs to be traced and re-evaluated.

> Maybe you can detect tags with regex, but maybe you shouldn't...

So what do you think would be a more appropriate choice for writing a tokenizer?

You want (?:, not (?

Without the colon, the parser appears to be interpreting (? as "one or more instances of (", but ( is no a full expression by itself and therefore cannot be modified with a quantifier.

I actually meant (?<tag> in order to create a named capture.
It was supposed to be (?<tag> \w+ ) in order to create a named capture. The <tag> was apparently lost in editing. Thanks for the heads-up.