|
Here is the problem with that: Consider the string
abcdefgh Guess what!? I have the perfect regex to match your string. "abcdefgh"
So given a string literal, there is always a regex to match that literal. Namely, the literal itself.Really, what you want is a tool that, given several examples, will generate a regex that matches all of them. So you'd give it: aaaaabaa
aabaaa
aba
abaaaaa
And it'd generate "a+ba+"The problem with that is, given a corpus with a set of tokens { T0, T1, T2 ... }, I can give you a regex that will match the corpus! "[T0 T1 T2 ... ]*"
or even ".*"
So it will match everything in your corpus! But unfortunately, it will match a whole lot you don't want, too.So ideally you want a regex that matches everything in your corpus, but nothing outside the language you are trying to describe. This requires both positive and negative learning examples. The problem is that for most applications, you'd need a lot of negative examples. Source: Working on this exact problem for graduate research |
But that's pretty stupid, because you don't generalize beyond your examples.
What's your approach?
<em>edit: removed random conjecture</em>