Hacker News new | ask | show | jobs
by rntz 1016 days ago
Here's a simple Haskell program to do it:

(EDIT: this code is completely wrongheaded and does not work; it assumes that when sequencing regexes, you can take the product of their sizes to find the overall size. This is just not true. See reply, below, for an example.)

    -- https://gist.github.com/rntz/03604e36888a8c6f08bb5e8c665ba9d0

    import qualified Data.List as List

    data Regex = Class [Char]   -- character class
               | Seq [Regex]    -- sequence, ABC
               | Choice [Regex] -- choice, A|B|C
               | Star Regex     -- zero or more, A*
                 deriving (Show)

    data Size = Finite Int | Infinite deriving (Show, Eq)

    instance Num Size where
      abs = undefined; signum = undefined; negate = undefined -- unnecessary
      fromInteger = Finite . fromInteger
      Finite x + Finite y = Finite (x + y)
      _ + _ = Infinite
      Finite x * Finite y = Finite (x * y)
      x * y = if x == 0 || y == 0 then 0 else Infinite

    -- computes size & language (list of matching strings, if regex is finite)
    eval :: Regex -> (Size, [String])
    eval (Class chars) = (Finite (length cset), [[c] | c <- cset])
      where cset = List.nub chars
    eval (Seq regexes) = (product sizes, concat <$> sequence langs)
      where (sizes, langs) = unzip $ map eval regexes
    eval (Choice regexes) = (size, lang)
      where (sizes, langs) = unzip $ map eval regexes
            lang = concat langs
            size = if elem Infinite sizes then Infinite
                   -- finite, so just count 'em. inefficient but works.
                   else Finite (length (List.nub lang))
    eval (Star r) = (size, lang)
      where (rsize, rlang) = eval r
            size | rsize == 0 = 1
                 | rsize == 1 && List.nub rlang == [""] = 1
                 | otherwise = Infinite
            lang = [""] ++ ((++) <$> [x | x <- rlang, x /= ""] <*> lang)

    size :: Regex -> Size
    size = fst . eval
NB. Besides the utter wrong-headedness of the `product` call, the generated string-sets may not be exhaustive for infinite languages, and the original version (I have since edited it) was wrong in several cases for Star (if the argument was nullable or empty).
1 comments

Surely that fails for e.g. a?a?a?. I'd imagine you could do some sort of simplification first though to avoid this redundancy.
You're correct, and I don't see any good way to avoid this that doesn't involve enumerating the actual language (at least when the language is finite).

Oof, my hubris.

It turns out to be not that hard to just compute the language of the regex, if it is finite, and otherwise note that it is infinite:

    import Prelude hiding (null)
    import Data.Set (Set, toList, fromList, empty, singleton, isSubsetOf, unions, null)

    data Regex = Class [Char]   -- character class
               | Seq [Regex]    -- sequence, ABC
               | Choice [Regex] -- choice, A|B|C
               | Star Regex     -- zero or more, A*
                 deriving (Show)

    -- The language of a regex is either finite or infinite.
    -- We only care about the finite case.
    data Lang = Finite (Set String) | Infinite deriving (Show, Eq)

    zero = Finite empty
    one = Finite (singleton "")

    isEmpty (Finite s) = null s
    isEmpty Infinite = False

    cat :: Lang -> Lang -> Lang
    cat x y | isEmpty x || isEmpty y = zero
    cat (Finite s) (Finite t) = Finite $ fromList [x ++ y | x <- toList s, y <- toList t]
    cat _ _ = Infinite

    subsingleton :: Lang -> Bool
    subsingleton Infinite = False
    subsingleton (Finite s) = isSubsetOf s (fromList [""])

    eval :: Regex -> Lang
    eval (Class chars) = Finite $ fromList [[c] | c <- chars]
    eval (Seq rs) = foldr cat one $ map eval rs
    eval (Choice rs) | any (== Infinite) langs = Infinite
                     | otherwise = Finite $ unions [s | Finite s <- langs]
      where langs = map eval rs
    eval (Star r) | subsingleton (eval r) = one
                  | otherwise = Infinite