Hacker News new | ask | show | jobs
by wetherbeei 4687 days ago
The author found the optimal solution when trying to construct these expressions: store them in a trie. Because the generated regular expressions match only the inputs, this solution may find a more compact way to test if the inputs have lots of overlap.

It would be cool if the inputs weren't matched exactly, and frak could figure out a general pattern for your inputs (decimals, capitalized words, etc). That could help newcomers with a starting expression that matches their inputs.

3 comments

This was the subject of my honours thesis[1] back in 2000. I was actually inferring DTDs from sample XML documents, but it's the exact problem as DTDs use regular expressions and I only had positive examples to go on.

Based on existing methods my solution started with the same trie and then generalised to a more flexible DFA by merging states. I used information theory (specifically Minimum Message Length) to turn it into an optimisation problem and tried a few different algorithms, in the addition of Ant Colony Optimisation to an existing algorithm produced the best results for my tests. (They were pretty limited, though.)

[1] http://www.cse.unsw.edu.au/~wong/Papers/jason.pdf

> It would be cool if the inputs weren't matched exactly, and frak could figure out a general pattern for your inputs (decimals, capitalized words, etc).

The problem is that there are an huge number of possible solutions for any given input. For example, you could always give a trivial solution: ".*"

For something like that to work, you'd need a large dictionary of common patterns, and then you'd want to compare against the dictionary to see if it matches a sequence of common patterns.

I can't imagine that sort of thing being too useful.

An alternative solution is to provide a list of matches and non-matches, and look for the shortest or simplest regex that separates the two.

(Finding the actual minimal regex, instead of just a reasonable guess, might be a computationally tough problem. I guess it's in coNP and might be in NP, too. An algorithm in P would be nice to find.)

Update: Finding any separating regex would be in P. A separating finite automaton is easy to find, and then you just convert that into a regex. Now, how do you find the minimal regex?

I really want one of these please message me!!! (tom dot larkworthy at gmail)

preferably python but a link to theory is useful too.

I wrote a brute-force minimal-regex finder once; wish I could remember where I put it.
For your entertainment, https://github.com/matthiasgoergens/Div7 has a regex finder for divisibility. E.g. https://raw.github.com/matthiasgoergens/Div7/master/regex7 checks decimal numbers for divisibility by 7.
Very cute. :-)

I found my code and it wasn't exactly it (it just enumerated all regexes in order of size), but here's a crude solution now: https://github.com/darius/sketchbook/blob/master/regex/find_...

(In defense of my memory, I had written superoptimizers for other things.)

It depends on what the grandparent means with 'not exactly'. If it means 'within a certain edit distance', it is very doable. You can store your dictionary in an automaton and construct a Levenshtein[1] automaton in linear time. The intersection of the dictionary and Levenshtein automaton gives all words within the given edit distance. My library (Dictomaton) that I mention in another comment implements this. There is a different implementation in recent Lucene versions, that is used for finding documents with keywords that approximately match the query.

[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.6...

The problem is that this is unlikely to give what you want, since, for example, if my inputs are "123", "2315", "12451", ..., then what I'm looking for is probably going to be something along the lines of "\d+", and not the set of words that are within a small edit distance of a subset of integers.
It depends on the application. For instance, in search engines approximate matching is very useful. E.g. if people want to search for Rotterdam, they make typos like Roterdam or are foreigner and don't know that Rotterdam has a double t.

You can rank candidates afterwards. However, it's a good solution for finding words that are closed to a mistyped word.

I have successfully used such techniques in a spellchecker and various search engines.

With a set of true and false inputs you could evolve a regex:

https://www.lri.fr/~hansen/proceedings/2012/GECCO/companion/...

Why evolve, when you can solve exactly?
Actually, frak doesn't generate patterns that require an exact match. I use Clojure's `re-matches` function (for exact matches) in the example and the tests to show and ensure the generated patterns work as expected. Clojure has another function called `re-find` which is a bit more relaxed. An expression like #"bar" will fail with the string "barn" using `re-matches` but succeed with re-find. I plan to include an option for generating regexes that require exact matches.