| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by snazz 2474 days ago
	Had it not gotten extremely lucky, it would have a very negative score. Anything intended to be funny on StackOverflow doesn’t go over well very often.

1 comments

ChrisSD 2474 days ago

The answer was made in '09. StackOverflow was more accepting of some humour at the time. And it is making a valid point: regex isn't the right tool for this job.

link

unlinkr 2474 days ago

The question is about how to identify start and end-tags in XHTML. What would be an appropriate tool for that job?

link

shakna 2474 days ago

A parser. Specifically, an XHTML parser.

link

unlinkr 2474 days ago

How do you think an XHTML parser is written? In particular, how does an XHTML parser identify tokens like start and end tags?

link

journalctl 2474 days ago

With a pushdown automaton [1] or something like a linear bounded automaton [2].

[1] https://en.m.wikipedia.org/wiki/Pushdown_automaton

[2] https://en.m.wikipedia.org/wiki/Linear_bounded_automaton

More specifically, a stack lets you keep track of nesting. See an opening tag, push something onto a stack. See a closing tag, pop the stack. If the stack is empty at the end, the tags match.

Parsing XHTML in real life is of course much more complicated than this, but this is the basic idea.

link

unlinkr 2474 days ago

I think there are a lot of knee-jerk answer because people see "XHTML" and "regex" in the same sentence and immediately think "not possible".

But the actual question is clearly not about matching start tags to end tags or building DOM or anything like that - which indeed would require a stack. The question is about recognizing start and end tags. You can do that perfectly fine with regular expressions - indeed many parsers uses regular expressions to tokenize the input before parsing.

Furthermore, the question specifically needs to recognize the difference between start-tags and self-closing tags. A differece which is not exposed by most XHTML parsers a far as I am aware

link

saagarjha 2474 days ago

It keeps track of state that a regular expression cannot?

link

unlinkr 2474 days ago

You don't need to keep track of state to match tokens like XHTML start or end tags.

link

shakna 2473 days ago

If I were writing a limited parser, in answer to the narrow question being asked, I wouldn't be using regex at all. It's not suited to this particular problem. (For example it would get caught on things like <a attr="<b>"> which may well be valid input.)

link

unlinkr 2473 days ago

So how would you tokenize without the use of regular expressions? What more appropriate technique would you use instead?

The example you provide in not XHTML so not really relevant for the discussion. But in any case, a regular expression have no problem recognizing a quoted string.

link