Hacker News new | ask | show | jobs
by lifthrasiir 2184 days ago
The common case of only two pairs of quotes is indeed regular, but if you want to support either all Unicode quotes (about 60 pairs of them) or C++11 raw string literals `R"delim(...)delim"` (intrinsically not regular) you are out of luck.
1 comments

For any finite set of quotation character pairs you can get away with a strategy like `(left_char1)[^right_char1](right_char1)|(left_char2)[^right_char2](right_char2)|...`. Escape characters aren't much harder to accommodate.
Of course, but you are out of luck in terms of complexity. (Colloquial) regular expressions lack any kind of abstractions.
Absolutely. I don't for a moment think they're the right tool for the job.

They are fairly powerful in terms of what they're capable of parsing however (not enough for an arbitrary html document, but enough to handle the hairier situations in this thread that people thought they couldn't), and that does mean that a regular expression generator can handle all of those situations as well and potentially be much more readable.

If I found myself writing code like this I'd still want to reach for a better parsing technology, but you can use other languages to add abstractions to regex. Here's a Python3.6+ example assuming any desired backslashes have already been applied:

  '|'.join(rf'{a}[^{b}]*{b}' for a,b in pairs)