Hacker News new | ask | show | jobs
by wodenokoto 1870 days ago
The article goes as far as to say that a parser is not the right tool.

> Not only can the task be solved with a regular expression - regular expressions are basically the only practical way to solve the problem. Which is why none of the clever answers actually suggest another way to solve the problem.

So no, the author is not missing the point at all.

2 comments

The point is that a parser could very well use regexes under the hood to perform the tokenization. Because it is the right tool for the job. A language without regex-support might use something like lex to compile a lexer. Of course you can write a character-by-character lexer by hand, but this is just equivalent to what a regex would generate.

So saying "this is not possible, use a parser instead" is completely misunderstanding the relationship between lexing and parsing. I wonder how these people think a parser works?

I mean that bit is clearly wrong. An XML/HTML parser is a perfectly practical way to solve the problem.

However I completely agree that they didn't miss the point. A regex to do this might be fine for hacky things that you don't need to be robust (e.g. for searching for stuff, measuring stats, one-off scripts etc.).

Regular expressions can be as robust as you need them to be, just like any other kind of code. They are a DSL to create lexers, and they are exactly as robust (or hacky) as if you wrote the same lexer by hand.
C code can be as robust as you need it to be. So why bother with formal verification, safe C coding standards, Rust, etc?

The answer is that it can be robust, but the effort required to do that is so large that in practice it usually isn't.

Are you arguing that the effort required to make a regex robust and correct is larger than the effort required to make some hand-rolled character-by-character based lexer robust and correct?

Because that sounds counter-intuitive to me. A regex is a higher level DSL for lexing.

That's exactly what I'm arguing. Especially because it's very unlikely that you'd write an XML/HTML parser yourself instead of using somebody else's well-tested library.
OK but these are two separate question.

Of course you should use an existing library if it solves the exact problem you have. Don't waste time re-implementing the wheel unless you are doing if for educational purposes. Whether such a library used regexes or not under the hood would be irrelevant as long as it works and it well tested.

But I would certainly like to hear an argument why you think a regex is less robust that a similar manual character-by-character matcher.