Hacker News new | ask | show | jobs
by shakna 2475 days ago
If I were writing a limited parser, in answer to the narrow question being asked, I wouldn't be using regex at all. It's not suited to this particular problem. (For example it would get caught on things like <a attr="<b>"> which may well be valid input.)
1 comments

So how would you tokenize without the use of regular expressions? What more appropriate technique would you use instead?

The example you provide in not XHTML so not really relevant for the discussion. But in any case, a regular expression have no problem recognizing a quoted string.

> So how would you tokenize without the use of regular expressions?

Since this need doesn't appear to be an everyday one, with clearly defined targets, a simple hand-written lexer isn't hard to write, and will make less mistakes than a regex. Just use a scanning approach. As a bonus, you'll still be able to read in 12 months time.

Why would a hand-written lexer have fewer mistakes than a regular expression using an off-the-shelf regex engine? They would need to encode the same lexical grammar, so at that level there is the same amount of complexity.

Writing a lexer by hand is just trading ten lines of regex (a widely known declarative DSL) with hundreds of lines of custom code. I don't see how that would be more maintainable in the long run.