|
|
|
|
|
by mimmuz
3981 days ago
|
|
I think I understood your point but I suppose you misunderstood our intent.
Daily extraction tasks are commonly solvable with a regular expression, this tool is intended for people handling a task where a regular expression suffices in order to solve it.
An extraction task may be solved with a regular expression or may not, in the second case the regex generator is not the right tool for the job. The provided examples often cover a subset of the target language; there are infinite regular languages that fit the task---described by provided examples---and infinite number of regular expressions generating it.
Regex generator searches for a regular expression taking into account other constraints---i.e: it prefers small regular expressions.
In this way we use the regular expression length as heuristic that pushes towards more generic regular expressions and more human-readable-understandable solutions. We know that we cannot infer a regular language from an incomplete subset of it; regex generator is intended as a practical tool that solves a real-world problem. Anyhow, the literature is full of papers about inferring an automata from examples:
http://link.springer.com/chapter/10.1007/BFb0054059
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1299597
http://www.sciencedirect.com/science/article/pii/S0031320305...
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1432740 |
|
2. The only way for a user to know if the regular expression provided by the output is meaningful is to already know the regular expression that describes the desired state machine...or to knowingly accept that the output regular expression is based on rules that are not derived directly from the input and ideally to understand what those rules are. That's fine for sophisticated users who understand the context, but not for people who don't understand automata.
3. The end user gains nothing by withholding training data because the regular expression is deterministic. Withholding training data is only useful for understanding the heuristics of the generator and tuning it. The regular expression itself is simply right or wrong.
4. While automatic generation of automata sounds like the sort of thing that solves real world problems, the automatic generation of regular expressions of the sort programmers rely on seems more likely to produce bugs of the sort that arise when a programmer writes code that they don't understand.