| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by unlinkr 2480 days ago
	The question is about identifying end-tags in XHTML. This is indeed possible with a regex.

2 comments

nurettin 2480 days ago

link

unlinkr 2480 days ago

That is not XHTML.

link

hk__2 2480 days ago

What about <a>this  </a> ?

link

unlinkr 2480 days ago

Yes you can tokenize this with a regular expression and extract the valid start and end tags.

If comments in XHTML could nest you would have a problem. But this is not the case.

link

hk__2 2480 days ago

> Yes you can tokenize this with a regular expression and extract the valid start and end tags.

So you need more than a regular expression, hence your premise is incorrect.

link

unlinkr 2480 days ago

No, you don't need more than a regular expression. If you want to extract elements, i.e. match start tags to the corresponding end tags, then you need a stack-based parser. But just to extract the start tags (which is the question) a regular expression is sufficient.

The original question is a question about tokenization, not parsing, which is why a regular expression is sufficient.

link

nurettin 2480 days ago

link

unlinkr 2480 days ago

That is a valid XHTML tag (if I remember correctly) and can be matched perfectly fine by a regex.

link

nurettin 2480 days ago

Perhaps something like "([^"]*)" could skip what is inside the string literal. Unless there is "<input" in the string literal, then where you start parsing becomes very important.

link

unlinkr 2480 days ago

That pattern would indeed match a quoted string. I don't see how it would matter if the quoted string contains something like "<input". It can contain anything except a quote character.

link

bryanrasmussen 2480 days ago

theoretically I believe an end tag really requires a valid start tag.

anyway you can probably answer any number of simple questions about a bit of HTML using regex but as code wants to grow to handle more use cases there will come a time when the solution will break down and the code that wrote to handle all the previous uses will need to be rewritten using something other than regex.

link

unlinkr 2480 days ago

You have to distinguish between the different levels of parsing.

Regexes are appropriate for tokenization, which is the task of recognizing lexical units like start tags, end tags, comments and so on. The SO question is about selecting such tokens, so this can be solved with a regex.

If you have more complex use cases like matching start tags to end tags, you might need a proper parser on top. But you still need tokenization as a stage in that parser! I don't see what you would gain by using something other than regexes for tokenization? I guess in some extreme cases a hand written lexer could be more performant, but in the typical case a regex engine would probably be a lot faster than the alternatives and certainly more maintainable.

I know it is possible to write a parser without a clear tokenization/parsing separation - but it is not clear to me this would be beneficial in any way.

link

unlinkr 2480 days ago

An element requires a start and end tag, or a self-closing start tag.

link