| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gamegoblin 4499 days ago

Here is the problem with that:

Consider the string abcdefgh

Guess what!? I have the perfect regex to match your string.

  "abcdefgh"

So given a string literal, there is always a regex to match that literal. Namely, the literal itself.

Really, what you want is a tool that, given several examples, will generate a regex that matches all of them.

So you'd give it:

  aaaaabaa
  aabaaa
  aba
  abaaaaa

And it'd generate "a+ba+"

The problem with that is, given a corpus with a set of tokens { T0, T1, T2 ... }, I can give you a regex that will match the corpus!

  "[T0 T1 T2 ... ]*"

or even

  ".*"

So it will match everything in your corpus! But unfortunately, it will match a whole lot you don't want, too.

So ideally you want a regex that matches everything in your corpus, but nothing outside the language you are trying to describe. This requires both positive and negative learning examples. The problem is that for most applications, you'd need a lot of negative examples.

Source: Working on this exact problem for graduate research

3 comments

nmrm 4499 days ago

T0 | T1 | T2 | ... would match exactly the correct thing with all positive examples, and (T0 | T1 | T2) & !(CE1 | CE2 | CE3) would match exactly the correct thing with positive and negative examples.

But that's pretty stupid, because you don't generalize beyond your examples.

What's your approach?

<em>edit: removed random conjecture</em>

link

gamegoblin 4499 days ago

You have to have some sort of heuristic that determines what a "good" regex is, since there are undoubtedly multiple regexes that describe a corpus.

A simple heuristic is the smallest regex.

So in your example, given the training examples:

  aba
  abaa
  aaaaba

and the counter examples:

  abba
  ba
  ab

It's clear to a human I probably want to match "a+ba+". That's clearly much smaller than ("aba" | "abaa" | "aaaaba") & !("abba" | "ba" | "ab"), so it would be a "better" regex.

link

weavie 4499 days ago

Sounds like you want to be able to specify some kind of pattern to define accepted and rejected matches. A regex would be ideal for this. oh wait....

link

Dewie 4499 days ago

Since you're a researcher I must be missing something. But since regexps are closed under union, what is the problem with taking the union of all of them? I'm imagining that it would be conceptually simple to hook up all of the non deterministic state machines such that you get a non deterministic state machine which is the union of all of them. Then convert it to a deterministic state machine. You might get state explosion, but at least you would have found some machine to recognize the language. Is state minimization simple (complexity wise)? Is it even possible to find a decently small DSM in the general case (not necessarily the most minimal machine)?

link

gamegoblin 4499 days ago

My reply to nmrm might answer your question.

Finding some regular expression that matches all of the positive examples and does not match all of the negative examples is trivial. Finding a good regular expression that does that is not.

State minimization does not mitigate this problem. As an aside, state minimization is a polynomial algorithm.

Given the positive examples:

  aba
  abaa
  aaaaba

and the negative examples:

  abba
  ba
  ab

we could make a regex that does something like ("aba" | "abaa" | "aaaaba") & !("abba" | "ba" | "ab"), but unfortunately, running a state minimization algorithm on this regex does not give you "a+ba+" because the two regex are not equivalent (they do not accept the same language).

So you can find plenty of regex that will match your examples and not match your counterexamples, but you cannot easily minimize them to what you do want.

link