| HN Mirror

You have to have some sort of heuristic that determines what a "good" regex is, since there are undoubtedly multiple regexes that describe a corpus.

A simple heuristic is the smallest regex.

So in your example, given the training examples:

  aba
  abaa
  aaaaba

and the counter examples:

  abba
  ba
  ab

It's clear to a human I probably want to match "a+ba+". That's clearly much smaller than ("aba" | "abaa" | "aaaaba") & !("abba" | "ba" | "ab"), so it would be a "better" regex.