| A couple of notes: - The Austin Group has recently accepted lazy quantifiers for inclusion into the next release of POSIX[1]. They think they have worked out a reasonable declarative definition for what they should mean. I am less than sure of that, but either way dismissing the whole thing as irredeemably tied to backtracking seems inappropriate. - Once again the generalization in the title is AFAIK largely correct for industrial engines, but incorrect—arguably to the point of being misleading—for academic work. Just looking into the “parsing” subfolder of my papers stash reveals a 1998 paper[2] on linear-time maximal-munch tokenization, so at the very least the problem was recognized, and IIRC there’s a bunch of related work around that paper too. - It is true that you can’t stream the haystack in the general case, but to what precise extent you can is an interesting question with a known algorithmic answer[3]. [1] https://www.austingroupbugs.net/view.php?id=793, https://www.austingroupbugs.net/view.php?id=1329, https://www.austingroupbugs.net/view.php?id=1857, see also the mailing list. [2] Reps (1998), ACM TOPLAS 20(2), 259–273, https://dl.acm.org/doi/10.1145/276393.276394, https://research.cs.wisc.edu/wpis/papers/toplas98b.pdf. [3] Grathwohl, Henglein, Rasmussen (2014), ICTAC ’14, LNCS 8687, 224–240, https://link.springer.com/chapter/10.1007/978-3-319-10882-7_..., https://utr.dk/pubs/files/grathwohl2014-0-paper.pdf. |
That is surprising!
We've found that in certain simpler scenarios it's possible to use complement to express lazy quantifiers, but in the general case it appears very fragile.
a simple example: a.*?b can be rewritten to something like (a.*b&~(a.*b_+)) in RE# syntax, which effectively means "there is a match, but must not contain a shorter match"
> 1998 TOPLAS
I will read through it properly, but i can already see from page 8 that it requires a table of (DFA size x input length) which makes me very suspicious that it is more of a thought exercise than a real world solution.
> It is true that you can’t stream the haystack in the general case, but to what precise extent you can is an interesting question with a known algorithmic answer[3].
Thank you for this, this is an interesting paper