| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dasht 5987 days ago

Hi, rsc. I'm Thomas Lord. Also, I got something wrong in my pedantry there, which is the absolute worst place to get this kind of thing wrong: it's not O(1) space - it's O(length-of-regexp), no?

Going through the points:

1. Ok. I think you can also say, overall, linear in X times Y.

2. The 1985 Dragon edition has, in the second to last paragraph of section 3.7, these remarks: "A third approach is to use a DFA, but avoid constructing all of the transition table by using a technique called 'lazy transition evaluation'. Here, transitions are computed at run time but a transition from a given state on a given character is not determined until it is actually needed. The computed transitions are stored in a cache. Each time a transition is about to be made the cache is consulted. If the transition is not there, it is computed and stored in the cache. If the cache becomes full, we can erase some previously computed transition to make room for the new transition."

There's nothin' new under the sun, kid ;-)

(And that paragraph cost me a few years of part time hacking :-)

Oh, and, I don't think you're "stealing" anything from anyone, let alone thunder. Just that it's well known.

It's not currently separately distributed and, like I said, the code is a bit embarrassingly hairy and messed up -- but if you grab the source of GNU Arch and look deep in the tree under the "hackerlab" library you'll find Rx which has a (fairly old, prbly now slightly bitrotted) implementation of a caching DFA. Atop that -- oh you asked for references -- is Henry Spencer's algorithm for full Posix matching (backreferences, etc.) which I recall that he dubbed "recursive decomposition". The implementation in Rx is haired up because I wanted both tightly controllable DFA caching and the ability to match against strings that weren't contiguous in memory -- and then it's haired up more after than when I bolted on some Unicode support for UTF-8 and UTF-16; I've no cite for you for the Spencer thing -- personal correspondence.

I've also no cite for you for the other interesting one to look at which (last I knew) is the current regexp implementation in GNU libc from damnit-i'm-lame-but-the-guy's-name-escapes-me. He did an interesting twist on how to handle the non-regular Posix regexp features and made the (to my knowledge) first new thing since the naive backtracking versions and since Spencer's recursive decomposition.

With the addition of you, btw, I think there are maybe about 5 or 10 of us in the world who know or care about any of this stuff at the nitty-gritty implementation level, btw.

3. In Rx, looking at the code (haven't touched for several years), it appears that when flushing a DFA state I did it in two stages. First, I just brute-force away the incoming transitions and mark them as "warning, that state is going to go away." If one of those transitions is then taken before the state actually goes away it gets points to keep that state alive in the cache. Otherwise, the state eventually gets flushed. Off the top of my head I'm not able to prove that this wins over the naive full flush you've got. I am sure the naive full flush is simpler to code in a confident way - the Rx code is more hairy than I'd like. I'm "pretty sure" you can get better performance on a wider range of cases with something more like the approach I took in Rx. It's been a long time since I've handled the code and benchmarked stuff, though so, please, accept this fine grain of salt along with those opinions.

4. I'll take your word for that. Back before unicode support when I was most intensely optimizing Rx I spent a decent amount of time looking at the assembly code generated by the compiler and shaving instructions here and there. Incredibly f'ed up hobby, that :-)

-------------

Those points aside, more shop talk:

Back in yonder day when I was active on that project one of my fantasies (hence all the stuff about tunable memory usage, non-contiguous strings, etc.) was to make a dfa engine (not the posix regexp engine, just the dfa bit) that would rock Cisco's world so much that they'd just have to throw money at me. Heh.

The other thing is more interesting and useful, though: More recently I did some crazy-strange work in a backwater of the "bioinformatics" world. The problem took weeks or months to extract from the biologists but boiled down to: "Here are a few million regexps, all of which are of the form [big hairy regular expression]. Here is the reference sequence for the human genome (all 3bln base pairs or their approximations, a bit over 2 bits per base pair). Go find a way to make a list of all the matches, indexed by which regexp matches what part, as quickly and cheaply as possible."

Rx was useless for this, of course.

The stuff I learned building Rx was the opposite of useless. I'll bet (low stakes) that sometime down the road you'll have a similar experience.