| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by noduerme 1013 days ago
	Just wondering, what is it about testing repetition [a-z]{1,256} with an upper bound that's so heavy? Intuitively it feels like greedy testing [a-z]+ should actually be worse since it has to work back from the end of the input.

2 comments

rntz 1012 days ago

This depends heavily on how repetition is implemented.

With a backtracking-search implementation of regexes, bounded iteration is pretty easy.

But the linked webpage appears to compile regexes to finite state machines (it shows you their finite-state-machine, for instance), and eg [a-z]{1,256} will have 256 states: 256 times the 1 state needed for [a-z]. If [a-z] were a complex regex, you could get a combinatorial explosion.

This alone probably isn't the issue? 256 is not a very large number. But I suspect there are follow-on algorithmic issues. This is just speculation, but I wouldn't be surprised if that 256-state machine were computed by applying DFA minimization, an algorithm with worst-case exponential running time, to a more naively generated machine.

link

d66 1012 days ago

you're right. inclusion/intersection/etc. aren't actually computed via DFA but instead are computed directly on the regular expression representation itself. and large disjunctions (with 256 branches) are what is very heavy.

(it's possible to instead do these operations on DFAs but at the time i found it hard to get from an automata back to a reasonable-looking regular expression.)

link

d66 1013 days ago

the library uses a fairly simple data representation where x{m,n} is compiled using conjunction and disjunction. so x{1,4} ends up being represented as x|xx|xxx|xxxx.

this simplifies the code for testing equality and inclusion, since logically x{n} is just xx... (n times) and x{m,n} is just x{m}|x{m+1}|...|x{n}.

but when you have x{m,n} and n-m is large you can imagine what kind of problems that causes.

link

yorwba 1012 days ago

Seems like it should at least be x(|x(|x(|x))) instead of x|xx|xxx|xxxx to avoid quadratic blow-up.

link

d66 1012 days ago

yes, that is the actual construction: the disjunction data type only supports a lhs and rhs, so that is the only possible way to represent it.

i wrote it the way i did for clarity in the comments.

link