I feel like it shouldn't be too hard to calculate from the finite automaton that encodes the regular expression, but surely in most cases it will simply be infinite?
This is hitting back a long time. But the algorithm - if I recall right - is a simple DFS on the determinstic automaton for the regular expression and it can output the full set of matching strings if you're allowed to use *s in the output.
Basically, you need an accumulator of "stuff up to here". If you move from a node to a second node, you add the character annotating that edge to the accumulator. And whenever you end up with an edge to a visited node, you add a '*' and output that, and for leaf nodes, you output the accumulator.
And then you add a silly jumble of parenthesis on entry and output to make it right. This was kinda simple to figure out with stuff like (a(ab)*b)* and such.
This is in O(states) for R and O(2^states) for NR if I recall right.
Say you want to compute all strings of length 5 that the automaton can generate. Conceptually the nicest way is to create an automaton that matches any five characters and then compute the intersection between that automaton and the regex automaton. Then you can generate all the strings in the intersection automaton. Of course, IRL, you wouldn't actually generate the intersection automaton (you can easily do this on the fly), but you get the idea.
Automata are really a lost art in modern natural language processing. We used to do things like store a large vocabulary in an deterministic acyclic minimized automaton (nice and compact, so-called dictionary automaton). And then to find, say all words within Levenshtein distance 2 of hacker, create a Levenshtein automaton for hacker and then compute (on the fly) the intersection between the Levenshtein automaton and the dictionary automaton. The language of the automaton is then all words within the intersection automaton.
I wrote a Java package a decade ago that implements some of this stuff:
the page actually does give these. for α := [a-z]{2,4} the page gives |α| = 475228.
however, as others have pointed out any non-trivial use of the kleene star means the result will be ∞. in this case the page will list numbers that roughly correspond to "number of strings with N applications of kleene star" in addition to infinity.
(EDIT: this code is completely wrongheaded and does not work; it assumes that when sequencing regexes, you can take the product of their sizes to find the overall size. This is just not true. See reply, below, for an example.)
-- https://gist.github.com/rntz/03604e36888a8c6f08bb5e8c665ba9d0
import qualified Data.List as List
data Regex = Class [Char] -- character class
| Seq [Regex] -- sequence, ABC
| Choice [Regex] -- choice, A|B|C
| Star Regex -- zero or more, A*
deriving (Show)
data Size = Finite Int | Infinite deriving (Show, Eq)
instance Num Size where
abs = undefined; signum = undefined; negate = undefined -- unnecessary
fromInteger = Finite . fromInteger
Finite x + Finite y = Finite (x + y)
_ + _ = Infinite
Finite x * Finite y = Finite (x * y)
x * y = if x == 0 || y == 0 then 0 else Infinite
-- computes size & language (list of matching strings, if regex is finite)
eval :: Regex -> (Size, [String])
eval (Class chars) = (Finite (length cset), [[c] | c <- cset])
where cset = List.nub chars
eval (Seq regexes) = (product sizes, concat <$> sequence langs)
where (sizes, langs) = unzip $ map eval regexes
eval (Choice regexes) = (size, lang)
where (sizes, langs) = unzip $ map eval regexes
lang = concat langs
size = if elem Infinite sizes then Infinite
-- finite, so just count 'em. inefficient but works.
else Finite (length (List.nub lang))
eval (Star r) = (size, lang)
where (rsize, rlang) = eval r
size | rsize == 0 = 1
| rsize == 1 && List.nub rlang == [""] = 1
| otherwise = Infinite
lang = [""] ++ ((++) <$> [x | x <- rlang, x /= ""] <*> lang)
size :: Regex -> Size
size = fst . eval
NB. Besides the utter wrong-headedness of the `product` call, the generated string-sets may not be exhaustive for infinite languages, and the original version (I have since edited it) was wrong in several cases for Star (if the argument was nullable or empty).
You're correct, and I don't see any good way to avoid this that doesn't involve enumerating the actual language (at least when the language is finite).
It turns out to be not that hard to just compute the language of the regex, if it is finite, and otherwise note that it is infinite:
import Prelude hiding (null)
import Data.Set (Set, toList, fromList, empty, singleton, isSubsetOf, unions, null)
data Regex = Class [Char] -- character class
| Seq [Regex] -- sequence, ABC
| Choice [Regex] -- choice, A|B|C
| Star Regex -- zero or more, A*
deriving (Show)
-- The language of a regex is either finite or infinite.
-- We only care about the finite case.
data Lang = Finite (Set String) | Infinite deriving (Show, Eq)
zero = Finite empty
one = Finite (singleton "")
isEmpty (Finite s) = null s
isEmpty Infinite = False
cat :: Lang -> Lang -> Lang
cat x y | isEmpty x || isEmpty y = zero
cat (Finite s) (Finite t) = Finite $ fromList [x ++ y | x <- toList s, y <- toList t]
cat _ _ = Infinite
subsingleton :: Lang -> Bool
subsingleton Infinite = False
subsingleton (Finite s) = isSubsetOf s (fromList [""])
eval :: Regex -> Lang
eval (Class chars) = Finite $ fromList [[c] | c <- chars]
eval (Seq rs) = foldr cat one $ map eval rs
eval (Choice rs) | any (== Infinite) langs = Infinite
| otherwise = Finite $ unions [s | Finite s <- langs]
where langs = map eval rs
eval (Star r) | subsingleton (eval r) = one
| otherwise = Infinite
https://regex-generate.github.io/regenerate/ (I'm one of the authors) enumerates all the matching (and non-matching) strings, which incidentally answers the question, but doesn't terminate in the infinite case.
I feel like it might be possible with dataflow analysis. Stepping through the regex maintaining a liveness set or something like that. Sort of like computing exemplar inputs, but with repetition as permitted exemplars. Honestly probably end up re-encoding the regex in some other format, perhaps with 'optimizations applied.'
I think that's just the ordinal corresponding to lexicographic order on the words in the language, so yeah that should work. I wonder how easy it is to calculate...
normally you would use an ordinal number [1] to label individual elements of an infinite set while using a cardinal number [2] to measure the size of the set.
i believe the cardinality of a set of words from a finite alphabet (with more than one member) is equivalent to the cardinality of the real numbers. this means that the cardinality of .* is c.
unfortunately, i don't think that cardinality gets us very far when trying to differentiate the "complexity" of expressions like [ab]* from ([ab]*c[de]*)*[x-z]*. probably some other metric should be used (maybe something like kolmogorov complexity).
I wouldn't say that's their 'normal' usage, I mean sure you can use them like that but fundamentally ordinal numbers are equivalence classes of ordered sets in the same way that cardinal numbers are equivalence classes of sets.
As you've rightly noted the latter equivalence class gets us nothing so throwing away the ordering is a bit of a waste. Of all mathematical concepts 'size' is easily the most subjective so picking one that is interesting is better than trying to be 'correct'.
In particular a*b* is exactly equivalent to ω^2, since a^n b^m < a^x b^y iff n < x or n=x and m<y. This gives an order preserving isomorphism between words of the form a^n b^m and tuples (n,m) with lexicographic ordering.