|
|
|
|
|
by brudgers
3980 days ago
|
|
1. The smallest regular expression that contains the input is either: The empty language + the positive examples
or Kleene star - the negative examples
arbitrarily based on the number of examples in one set versus the other.2. The only way for a user to know if the regular expression provided by the output is meaningful is to already know the regular expression that describes the desired state machine...or to knowingly accept that the output regular expression is based on rules that are not derived directly from the input and ideally to understand what those rules are. That's fine for sophisticated users who understand the context, but not for people who don't understand automata. 3. The end user gains nothing by withholding training data because the regular expression is deterministic. Withholding training data is only useful for understanding the heuristics of the generator and tuning it. The regular expression itself is simply right or wrong. 4. While automatic generation of automata sounds like the sort of thing that solves real world problems, the automatic generation of regular expressions of the sort programmers rely on seems more likely to produce bugs of the sort that arise when a programmer writes code that they don't understand. |
|
2. to evaluate the solution quality on the training data is wrong. In order to mitigate the overfitting risk, the Regex Generator learns the regular expressions from half of the training examples and validates them on the other half of the examples. We also assessed our algorithm on 20 extraction tasks and evaluated the final solutions on unknown corpora (testing): the quality of final solutions is comparable to expert human solutions.
"Sophisticated" regex users don't need our tool: regex generator is intended for novice users or to demonstrate that we can automatically find solutions which are comparable with human ones.
Please note that defining an extraction task is always error prone, there may be errors in the task definition (understanding) or during the regex coding; smart and expert programmers make errors too, there may be corner cases they have not thought about. Sometimes, you need to get the job done--with a fair confidence--and improve it later. This is our view of the real world.
Most important thing: your criticisms are valid for all the problems of supervised machine learning. Do you really think taht driverless car, antispam filters, automatic transaltors and so on are useless only because they are trying to infer a model from partial data? I do not think so.