Hacker News new | ask | show | jobs
by hk__2 2474 days ago
It’s a famous post on StackOverflow, but I don’t find it particularly helpful.
5 comments

I found it extremely helpful. The sheer emphasis of the reply made me very curious why the idea of using regex on xml is so bonkers.

- I will never forget that regex can't parse XHTML

- the reason being, regex is insufficiently powerful

- when I first saw this post, I knew little about regex under the hood, this sent me down a wiki hole of FSMs, pushdown automata and turing machines

- this misconception is apparently common enough to be madness-inducing to those that know better

- use a hecking xml parser instead

It almost reminds me of a Bill Nye sketch. Teaching through a bit of non-sequitur and absurdism.

The problem with the answer is it is wrong. The question is about identifying start-tags in XHTML. This is a question of tokenization and can be solved with a regular expression. Indeed, most parsers use regular expressions for the tokenization stage. It is exactly the right tool for the job!

Furthermore, the asker specifically needs to distinguish between start tags and self-closing start tags. This is a token-level difference which is typically not exposed by XHTML parsers. So saying "use a parser" is less than helpful.

I have elaborated a bit in blog post: https://www.cargocultcode.com/solving-the-zalgo-regex/

If you were looking for the reason why a regex cannot parse HTML, it is because HTML has matching nested tags and regex parsers are finite state machines (FSM).

What this means is that a regex parser is like a goldfish. It only knows about the state it is currently in (what it just read) and which possible states it may transition to (what is legally allowed to come next). The fish never remembers where it was before; there is no option to have the legality of a transition depend on what it read _before_ the current state. But this memory function is a requirement to recursively match opening tags to their closing tags - you need to keep a stack of opening stacks somewhere in order to then cross off their closing tags in reverse order. So regex cannot parse HTML.

>What this means is that a regex parser is like a goldfish.

Like a drunk goldfish, or a sober goldfish?

'A method to study short-term memory (STM) in the goldfish.'

'Twenty-one common goldfish (13-15.5 cm long) were randomly divided into alcohol (A) and nonalcohol (NA) groups and were trained in an alcohol solution of 400 mg/100 ml or in water, respectively. All alcohol fish were placed in an alcohol solution of 400 mg/100 ml for 3 hr before training in the same alcohol concentration. Fish were trained on a position discrimination task for 2 consecutive days. The door used for training was that opposite to each fish's spontaneous preference. Savings in relearning on Day 2 was taken as a measure of long term memory strength. Only fish which reached criterion on both days were immediately given 10 forced reversal trails in the opposite direction (i.e., a fish trained on right door was forced to choose the left door.) A and NA subjects were then tested after a 5 min (STM) delay, respectively, in a free choice situation for 10 trails (i.e., neither door was blocked). The results suggest that alcohol facilitates the STM of the forced reversal information.'

https://www.ncbi.nlm.nih.gov/pubmed/935220

The question is about identifying end-tags in XHTML. This is indeed possible with a regex.
<a href="</a>">try this</a>
That is not XHTML.
What about <a>this <!-- </a> --> </a> <!-- </a> -->?
Yes you can tokenize this with a regular expression and extract the valid start and end tags.

If comments in XHTML could nest you would have a problem. But this is not the case.

<input value="how about this? />"/>
That is a valid XHTML tag (if I remember correctly) and can be matched perfectly fine by a regex.
theoretically I believe an end tag really requires a valid start tag.

anyway you can probably answer any number of simple questions about a bit of HTML using regex but as code wants to grow to handle more use cases there will come a time when the solution will break down and the code that wrote to handle all the previous uses will need to be rewritten using something other than regex.

You have to distinguish between the different levels of parsing.

Regexes are appropriate for tokenization, which is the task of recognizing lexical units like start tags, end tags, comments and so on. The SO question is about selecting such tokens, so this can be solved with a regex.

If you have more complex use cases like matching start tags to end tags, you might need a proper parser on top. But you still need tokenization as a stage in that parser! I don't see what you would gain by using something other than regexes for tokenization? I guess in some extreme cases a hand written lexer could be more performant, but in the typical case a regex engine would probably be a lot faster than the alternatives and certainly more maintainable.

I know it is possible to write a parser without a clear tokenization/parsing separation - but it is not clear to me this would be beneficial in any way.

An element requires a start and end tag, or a self-closing start tag.
Had it not gotten extremely lucky, it would have a very negative score. Anything intended to be funny on StackOverflow doesn’t go over well very often.
The answer was made in '09. StackOverflow was more accepting of some humour at the time. And it is making a valid point: regex isn't the right tool for this job.
The question is about how to identify start and end-tags in XHTML. What would be an appropriate tool for that job?
A parser. Specifically, an XHTML parser.
How do you think an XHTML parser is written? In particular, how does an XHTML parser identify tokens like start and end tags?
With a pushdown automaton [1] or something like a linear bounded automaton [2].

[1] https://en.m.wikipedia.org/wiki/Pushdown_automaton

[2] https://en.m.wikipedia.org/wiki/Linear_bounded_automaton

More specifically, a stack lets you keep track of nesting. See an opening tag, push something onto a stack. See a closing tag, pop the stack. If the stack is empty at the end, the tags match.

Parsing XHTML in real life is of course much more complicated than this, but this is the basic idea.

It keeps track of state that a regular expression cannot?
If I were writing a limited parser, in answer to the narrow question being asked, I wouldn't be using regex at all. It's not suited to this particular problem. (For example it would get caught on things like <a attr="<b>"> which may well be valid input.)
I don't think that answer was written with the intent of being particularly helpful, I think it was written with a different goal in mind.
It is helpful in that almost koan way though.
The author said at one point that it was written in a bit of a huff of frustration, and if I remember correctly, also after a pint or two. It's not the best answer answer by any means, but it is one of those small gems of the web, I think.
It is helpful. It may not provide the answer the OP wanted, but it provides the answer they needed.
it's sad that today the stack overflow majority shares your mindset and an answer like that would drown in downvotes or killed by moderation.

some people like to act super serious all the time like they're playing a sitcom version of what they think adulthood is in a quest to be the most boring person on earth like if that's the goal of human interaction

You can be funny and helpful at the same time.
which is exactly what that so answer was
The problem is it is funny and wrong. Apparently it have given a lot of people really confused ideas about what is possible and what is not possible with regular expressions.

If it had been funny and right I would not have a problem with it.

Stack overflow is not a homework help group, the "Regex are the wrong tool for the job" answer is more correct than the answer with the working Regex.