Hacker News new | ask | show | jobs
by musingsole 1180 days ago
No True Scotsman.

For years I struggled to grok the difference between regular expression and parsing. The conversations I would hear about the topic always left me feeling there must be some interesting meat to the difference...and so I would go read up on it only to find myself with the same problem a few months down the line.

The only difference is layers. Enough regular expression becomes a parser. A slim enough compiler is just a templating engine.

Insisting it's a deep, dark magic and not the advanced application of `str.replace` does the industry a disservice.

3 comments

> Insisting it's a deep, dark magic and not the advanced application of `str.replace` does the industry a disservice.

Respectfully, this perspective contradicts the computer science underlying our work--there are in fact different categories of string "languages" that can be parsed using different technologies, and regular expressions are not enough for most "languages".

See https://en.wikipedia.org/wiki/Chomsky_hierarchy for the different categories of languages and the required tech to parse them.

I consider CS theory to be about the deepest and darkest magic I get to use on a daily basis :P.

> The only difference is layers. Enough regular expression becomes a parser. A slim enough compiler is just a templating engine.

You can't add enough layers of standard regex to parse, say, C++ or even a tree of matching parenthesis. You need a different tech like CFGs.

Hope this helps improve the grokking! The "interesting meat" here actually points at some really deep, fundamental CS theory that's really worth knowing.

"regular expression" != "regular language"

I appreciate the academic frameworks that have led to the development of systems, however as a working engineer, ignoring where the rubber meets the road is my entire complaint.

Regular expressions can only parse regular languages. More powerful parser techniques can parse context-free languages or even recursively enumerable languages. These are fundamentally distinct levels of complexity.

It is fundamentally impossible to correctly parse e.g. HTML with regular expressions. See: https://stackoverflow.com/a/1732454

Ah yes, the classic StackOverflow answer. If that doesn't teach you not to parse HTML with regex, nothing will...
Parsing is way more interesting than regexes. Look up recursive descent parsers to get an idea of where to start. They are conceptually simple and great for getting an overview of how, exactly, you can write a recognizer for different sets of text.