| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by simlevesque 1015 days ago
	Kinda related but I'm looking for something that could give me the number of possible matching strings for a simple regex. Does such a tool exist ?

9 comments

contravariant 1015 days ago

I feel like it shouldn't be too hard to calculate from the finite automaton that encodes the regular expression, but surely in most cases it will simply be infinite?

link

tetha 1015 days ago

This is hitting back a long time. But the algorithm - if I recall right - is a simple DFS on the determinstic automaton for the regular expression and it can output the full set of matching strings if you're allowed to use *s in the output.

Basically, you need an accumulator of "stuff up to here". If you move from a node to a second node, you add the character annotating that edge to the accumulator. And whenever you end up with an edge to a visited node, you add a '*' and output that, and for leaf nodes, you output the accumulator.

And then you add a silly jumble of parenthesis on entry and output to make it right. This was kinda simple to figure out with stuff like (a(ab)*b)* and such.

This is in O(states) for R and O(2^states) for NR if I recall right.

link

kadoban 1015 days ago

Maybe the number of possible matchings for a given length (or range of lengths) might be interesting?

link

microtonal 1015 days ago

Say you want to compute all strings of length 5 that the automaton can generate. Conceptually the nicest way is to create an automaton that matches any five characters and then compute the intersection between that automaton and the regex automaton. Then you can generate all the strings in the intersection automaton. Of course, IRL, you wouldn't actually generate the intersection automaton (you can easily do this on the fly), but you get the idea.

Automata are really a lost art in modern natural language processing. We used to do things like store a large vocabulary in an deterministic acyclic minimized automaton (nice and compact, so-called dictionary automaton). And then to find, say all words within Levenshtein distance 2 of hacker, create a Levenshtein automaton for hacker and then compute (on the fly) the intersection between the Levenshtein automaton and the dictionary automaton. The language of the automaton is then all words within the intersection automaton.

I wrote a Java package a decade ago that implements some of this stuff:

https://github.com/danieldk/dictomaton

link

contravariant 1015 days ago

> deterministic acyclic minimized automaton

That's basically a Trie right? To be fair I have only heard of them and know they can be used to do neat tricks, I've rarely used one myself.

link

Someone 1015 days ago

If you do not plan to update it often and don’t need to store extra data with each word, a dawg (https://en.wikipedia.org/wiki/Deterministic_acyclic_finite_s...) is more compact. You often can merge leaf nodes.

For example, if you have words

   talk
   talked
   talking
   talks
   walk
   walked
   walking
   walks

there’s no need to repeat the “”, “ed”, “ing”, “s” parts.

link

microtonal 1015 days ago

No. In a minimized automaton shared string suffixes also share states/transitions.

link

abecedarius 1015 days ago

That was one of the short examples in Norvig's Python program-design course for Udacity. https://github.com/darius/cant/blob/master/library/regex-gen... (I don't have the Python handy.)

link

082349872349872 1015 days ago

see https://www.cs.dartmouth.edu/~doug/nfa.pdf

link

d66 1015 days ago

the page actually does give these. for α := [a-z]{2,4} the page gives |α| = 475228.

however, as others have pointed out any non-trivial use of the kleene star means the result will be ∞. in this case the page will list numbers that roughly correspond to "number of strings with N applications of kleene star" in addition to infinity.

link

rntz 1014 days ago

Here's a simple Haskell program to do it:

(EDIT: this code is completely wrongheaded and does not work; it assumes that when sequencing regexes, you can take the product of their sizes to find the overall size. This is just not true. See reply, below, for an example.)

    -- https://gist.github.com/rntz/03604e36888a8c6f08bb5e8c665ba9d0

    import qualified Data.List as List

    data Regex = Class [Char]   -- character class
               | Seq [Regex]    -- sequence, ABC
               | Choice [Regex] -- choice, A|B|C
               | Star Regex     -- zero or more, A*
                 deriving (Show)

    data Size = Finite Int | Infinite deriving (Show, Eq)

    instance Num Size where
      abs = undefined; signum = undefined; negate = undefined -- unnecessary
      fromInteger = Finite . fromInteger
      Finite x + Finite y = Finite (x + y)
      _ + _ = Infinite
      Finite x * Finite y = Finite (x * y)
      x * y = if x == 0 || y == 0 then 0 else Infinite

    -- computes size & language (list of matching strings, if regex is finite)
    eval :: Regex -> (Size, [String])
    eval (Class chars) = (Finite (length cset), [[c] | c <- cset])
      where cset = List.nub chars
    eval (Seq regexes) = (product sizes, concat <$> sequence langs)
      where (sizes, langs) = unzip $ map eval regexes
    eval (Choice regexes) = (size, lang)
      where (sizes, langs) = unzip $ map eval regexes
            lang = concat langs
            size = if elem Infinite sizes then Infinite
                   -- finite, so just count 'em. inefficient but works.
                   else Finite (length (List.nub lang))
    eval (Star r) = (size, lang)
      where (rsize, rlang) = eval r
            size | rsize == 0 = 1
                 | rsize == 1 && List.nub rlang == [""] = 1
                 | otherwise = Infinite
            lang = [""] ++ ((++) <$> [x | x <- rlang, x /= ""] <*> lang)

    size :: Regex -> Size
    size = fst . eval

NB. Besides the utter wrong-headedness of the `product` call, the generated string-sets may not be exhaustive for infinite languages, and the original version (I have since edited it) was wrong in several cases for Star (if the argument was nullable or empty).

link

sebzim4500 1014 days ago

Surely that fails for e.g. a?a?a?. I'd imagine you could do some sort of simplification first though to avoid this redundancy.

link

rntz 1014 days ago

You're correct, and I don't see any good way to avoid this that doesn't involve enumerating the actual language (at least when the language is finite).

Oof, my hubris.

link

rntz 1014 days ago

It turns out to be not that hard to just compute the language of the regex, if it is finite, and otherwise note that it is infinite:

    import Prelude hiding (null)
    import Data.Set (Set, toList, fromList, empty, singleton, isSubsetOf, unions, null)

    data Regex = Class [Char]   -- character class
               | Seq [Regex]    -- sequence, ABC
               | Choice [Regex] -- choice, A|B|C
               | Star Regex     -- zero or more, A*
                 deriving (Show)

    -- The language of a regex is either finite or infinite.
    -- We only care about the finite case.
    data Lang = Finite (Set String) | Infinite deriving (Show, Eq)

    zero = Finite empty
    one = Finite (singleton "")

    isEmpty (Finite s) = null s
    isEmpty Infinite = False

    cat :: Lang -> Lang -> Lang
    cat x y | isEmpty x || isEmpty y = zero
    cat (Finite s) (Finite t) = Finite $ fromList [x ++ y | x <- toList s, y <- toList t]
    cat _ _ = Infinite

    subsingleton :: Lang -> Bool
    subsingleton Infinite = False
    subsingleton (Finite s) = isSubsetOf s (fromList [""])

    eval :: Regex -> Lang
    eval (Class chars) = Finite $ fromList [[c] | c <- chars]
    eval (Seq rs) = foldr cat one $ map eval rs
    eval (Choice rs) | any (== Infinite) langs = Infinite
                     | otherwise = Finite $ unions [s | Finite s <- langs]
      where langs = map eval rs
    eval (Star r) | subsingleton (eval r) = one
                  | otherwise = Infinite

link

mikhailfranco 1014 days ago

Another interesting question is: how many possible successful matches are there for a given input string. For example:

How many ways can (a?){m}(a*){m} match the string a{m}

i.e. input m repetitions of the letter 'a'.

https://github.com/mike-french/myrex#ambiguous-example

The answer is a dot product of two vectors sliced from Pascal's Triangle.

For m=9, there are 864,146 successful matches.

link

someguy101010 1015 days ago

might be something like this https://jvns.ca/blog/2016/04/24/how-regular-expressions-go-f... which refs https://en.wikipedia.org/wiki/Brzozowski_derivative

link

Drup 1015 days ago

https://regex-generate.github.io/regenerate/ (I'm one of the authors) enumerates all the matching (and non-matching) strings, which incidentally answers the question, but doesn't terminate in the infinite case.

link

clord 1015 days ago

I feel like it might be possible with dataflow analysis. Stepping through the regex maintaining a liveness set or something like that. Sort of like computing exemplar inputs, but with repetition as permitted exemplars. Honestly probably end up re-encoding the regex in some other format, perhaps with 'optimizations applied.'

link

stvltvs 1015 days ago

The answer is usually an infinite number, except for very, very simple cases. Anything involving * for example means infinity is your answer.

link

skulk 1015 days ago

I wonder if it makes sense to compute an "order type" for a regexp. For example, a* is omega, a*b* is 2 omega.

https://en.m.wikipedia.org/wiki/Order_type

https://en.wikipedia.org/wiki/Ordinal_number

link

contravariant 1015 days ago

I think that's just the ordinal corresponding to lexicographic order on the words in the language, so yeah that should work. I wonder how easy it is to calculate...

link

d66 1015 days ago

normally you would use an ordinal number [1] to label individual elements of an infinite set while using a cardinal number [2] to measure the size of the set.

i believe the cardinality of a set of words from a finite alphabet (with more than one member) is equivalent to the cardinality of the real numbers. this means that the cardinality of .* is c.

unfortunately, i don't think that cardinality gets us very far when trying to differentiate the "complexity" of expressions like [ab]* from ([ab]*c[de]*)*[x-z]*. probably some other metric should be used (maybe something like kolmogorov complexity).

[1] https://en.wikipedia.org/wiki/Ordinal_number

[2] https://en.wikipedia.org/wiki/Cardinal_number

link

contravariant 1015 days ago

I wouldn't say that's their 'normal' usage, I mean sure you can use them like that but fundamentally ordinal numbers are equivalence classes of ordered sets in the same way that cardinal numbers are equivalence classes of sets.

As you've rightly noted the latter equivalence class gets us nothing so throwing away the ordering is a bit of a waste. Of all mathematical concepts 'size' is easily the most subjective so picking one that is interesting is better than trying to be 'correct'.

In particular a*b* is exactly equivalent to ω^2, since a^n b^m < a^x b^y iff n < x or n=x and m<y. This gives an order preserving isomorphism between words of the form a^n b^m and tuples (n,m) with lexicographic ordering.

link

d66 1015 days ago

interesting!

what would [ab]* be? for computing an ordinal number the only real difficulty is how to handle kleene star: given ord(X) how do we calculate ord(X*)?

but as you probably noticed i'm a bit out of my depth when dealing with ordinals.

link

pimlottc 1015 days ago

What’s your use case?

link

simlevesque 1013 days ago

Calculate how long it takes to bruteforce something matching a regexp.

link