| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by contravariant 1016 days ago
	I feel like it shouldn't be too hard to calculate from the finite automaton that encodes the regular expression, but surely in most cases it will simply be infinite?

3 comments

tetha 1016 days ago

This is hitting back a long time. But the algorithm - if I recall right - is a simple DFS on the determinstic automaton for the regular expression and it can output the full set of matching strings if you're allowed to use *s in the output.

Basically, you need an accumulator of "stuff up to here". If you move from a node to a second node, you add the character annotating that edge to the accumulator. And whenever you end up with an edge to a visited node, you add a '*' and output that, and for leaf nodes, you output the accumulator.

And then you add a silly jumble of parenthesis on entry and output to make it right. This was kinda simple to figure out with stuff like (a(ab)*b)* and such.

This is in O(states) for R and O(2^states) for NR if I recall right.

link

kadoban 1016 days ago

Maybe the number of possible matchings for a given length (or range of lengths) might be interesting?

link

microtonal 1016 days ago

Say you want to compute all strings of length 5 that the automaton can generate. Conceptually the nicest way is to create an automaton that matches any five characters and then compute the intersection between that automaton and the regex automaton. Then you can generate all the strings in the intersection automaton. Of course, IRL, you wouldn't actually generate the intersection automaton (you can easily do this on the fly), but you get the idea.

Automata are really a lost art in modern natural language processing. We used to do things like store a large vocabulary in an deterministic acyclic minimized automaton (nice and compact, so-called dictionary automaton). And then to find, say all words within Levenshtein distance 2 of hacker, create a Levenshtein automaton for hacker and then compute (on the fly) the intersection between the Levenshtein automaton and the dictionary automaton. The language of the automaton is then all words within the intersection automaton.

I wrote a Java package a decade ago that implements some of this stuff:

https://github.com/danieldk/dictomaton

link

contravariant 1016 days ago

> deterministic acyclic minimized automaton

That's basically a Trie right? To be fair I have only heard of them and know they can be used to do neat tricks, I've rarely used one myself.

link

Someone 1016 days ago

If you do not plan to update it often and don’t need to store extra data with each word, a dawg (https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_s...) is more compact. You often can merge leaf nodes.

For example, if you have words

   talk
   talked
   talking
   talks
   walk
   walked
   walking
   walks

there’s no need to repeat the “”, “ed”, “ing”, “s” parts.

link

microtonal 1016 days ago

No. In a minimized automaton shared string suffixes also share states/transitions.

link

abecedarius 1016 days ago

That was one of the short examples in Norvig's Python program-design course for Udacity. https://github.com/darius/cant/blob/master/library/regex-gen... (I don't have the Python handy.)

link

082349872349872 1016 days ago

see https://www.cs.dartmouth.edu/~doug/nfa.pdf

link