| [skipping wikipedia links] Formally, regular expressions describe finite automata. A finite automata is a state machine that accepts a language. Hence, a regular expression defines a language [and in particular a regular language]. Regular languages have closure properties, which is to say that under certain operations two regular languages produce another regular language, and therefore certain operations on regular expressions can produce another regular expression. The regular expression generator takes some input and produce a regular expression. There are two possible scenarios: 1. Treat the input as one or more regular languages
and perform only closed operations.
2. Treat the input as some other type of language,
e.g. a context free language and parse the input
according to the rules of that language to produce
a regular expression.
The first option means that we can determine the output of the regular expression generator by knowing only one rule: is does it generate regular expressions starting by adding to the empty language or by difference from the kleene star. But there is still ambiguity that requires we have knowledge about the internal workings of the expression generator.The second option means that in order to determine what output a given input will produce we have to know about a potentially unlimited number of rules (and these rules may change dynamically based on the input)[1]. What this means is that no regular expression generator is generally useful as a black box. Ambiguity assures it, and ambiguity is intrinsic because the regular expression generator must treat the input as within a more complex language (e.g. context free) and force it into the simpler class of regular languages. As my analysis shows, the way in which the regular expression generator interprets the input is arbitrary. If you're really interested in the topic, Jeff Ulman (who with Alfred Aho wrote the book on automata) teaches a course on automata via Coursera. Rumor (well actually an email) has it that he is teaching it again this fall. A person could do worse than learning about automata from a person down a Stanford hallway from Donald Knuth's office. Even cooler is that Jeff Ullman shows up in class to discuss questions (which is probably why the course only comes around once a year). https://www.coursera.org/course/automata Just because lunches cannot be free doesn't mean they can't be cheap. [1] Moreover, via the Halting Problem we cannot be certain that the regular expression generator will produce any output at all. |
The provided examples often cover a subset of the target language; there are infinite regular languages that fit the task---described by provided examples---and infinite number of regular expressions generating it. Regex generator searches for a regular expression taking into account other constraints---i.e: it prefers small regular expressions. In this way we use the regular expression length as heuristic that pushes towards more generic regular expressions and more human-readable-understandable solutions.
We know that we cannot infer a regular language from an incomplete subset of it; regex generator is intended as a practical tool that solves a real-world problem.
Anyhow, the literature is full of papers about inferring an automata from examples: http://link.springer.com/chapter/10.1007/BFb0054059 http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1299597 http://www.sciencedirect.com/science/article/pii/S0031320305... http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1432740