| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by hk__2 2475 days ago
	What about <a>this <!-- </a> --> </a> <!-- </a> -->?

1 comments

unlinkr 2475 days ago

Yes you can tokenize this with a regular expression and extract the valid start and end tags.

If comments in XHTML could nest you would have a problem. But this is not the case.

link

hk__2 2474 days ago

> Yes you can tokenize this with a regular expression and extract the valid start and end tags.

So you need more than a regular expression, hence your premise is incorrect.

link

unlinkr 2474 days ago

No, you don't need more than a regular expression. If you want to extract elements, i.e. match start tags to the corresponding end tags, then you need a stack-based parser. But just to extract the start tags (which is the question) a regular expression is sufficient.

The original question is a question about tokenization, not parsing, which is why a regular expression is sufficient.

link