| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by wetherbeei 4734 days ago
	The author found the optimal solution when trying to construct these expressions: store them in a trie. Because the generated regular expressions match only the inputs, this solution may find a more compact way to test if the inputs have lots of overlap. It would be cool if the inputs weren't matched exactly, and frak could figure out a general pattern for your inputs (decimals, capitalized words, etc). That could help newcomers with a starting expression that matches their inputs.

3 comments

jsankey 4734 days ago

This was the subject of my honours thesis[1] back in 2000. I was actually inferring DTDs from sample XML documents, but it's the exact problem as DTDs use regular expressions and I only had positive examples to go on.

Based on existing methods my solution started with the same trie and then generalised to a more flexible DFA by merging states. I used information theory (specifically Minimum Message Length) to turn it into an optimisation problem and tried a few different algorithms, in the addition of Ant Colony Optimisation to an existing algorithm produced the best results for my tests. (They were pretty limited, though.)

[1] http://www.cse.unsw.edu.au/~wong/Papers/jason.pdf

link

ori_b 4734 days ago

> It would be cool if the inputs weren't matched exactly, and frak could figure out a general pattern for your inputs (decimals, capitalized words, etc).

The problem is that there are an huge number of possible solutions for any given input. For example, you could always give a trivial solution: ".*"

For something like that to work, you'd need a large dictionary of common patterns, and then you'd want to compare against the dictionary to see if it matches a sequence of common patterns.

I can't imagine that sort of thing being too useful.

link

eru 4734 days ago

An alternative solution is to provide a list of matches and non-matches, and look for the shortest or simplest regex that separates the two.

(Finding the actual minimal regex, instead of just a reasonable guess, might be a computationally tough problem. I guess it's in coNP and might be in NP, too. An algorithm in P would be nice to find.)

Update: Finding any separating regex would be in P. A separating finite automaton is easy to find, and then you just convert that into a regex. Now, how do you find the minimal regex?

link

tlarkworthy 4734 days ago

I really want one of these please message me!!! (tom dot larkworthy at gmail)

preferably python but a link to theory is useful too.

link

abecedarius 4734 days ago

I wrote a brute-force minimal-regex finder once; wish I could remember where I put it.

link

eru 4734 days ago

For your entertainment, https://github.com/matthiasgoergens/Div7 has a regex finder for divisibility. E.g. https://raw.github.com/matthiasgoergens/Div7/master/regex7 checks decimal numbers for divisibility by 7.

link

abecedarius 4734 days ago

Very cute. :-)

I found my code and it wasn't exactly it (it just enumerated all regexes in order of size), but here's a crude solution now: https://github.com/darius/sketchbook/blob/master/regex/find_...

(In defense of my memory, I had written superoptimizers for other things.)

link

microtonal 4734 days ago

It depends on what the grandparent means with 'not exactly'. If it means 'within a certain edit distance', it is very doable. You can store your dictionary in an automaton and construct a Levenshtein[1] automaton in linear time. The intersection of the dictionary and Levenshtein automaton gives all words within the given edit distance. My library (Dictomaton) that I mention in another comment implements this. There is a different implementation in recent Lucene versions, that is used for finding documents with keywords that approximately match the query.

[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.6...

link

ori_b 4734 days ago

The problem is that this is unlikely to give what you want, since, for example, if my inputs are "123", "2315", "12451", ..., then what I'm looking for is probably going to be something along the lines of "\d+", and not the set of words that are within a small edit distance of a subset of integers.

link

microtonal 4734 days ago

It depends on the application. For instance, in search engines approximate matching is very useful. E.g. if people want to search for Rotterdam, they make typos like Roterdam or are foreigner and don't know that Rotterdam has a double t.

You can rank candidates afterwards. However, it's a good solution for finding words that are closed to a mistyped word.

I have successfully used such techniques in a spellchecker and various search engines.

link

wetherbeei 4734 days ago

With a set of true and false inputs you could evolve a regex:

https://www.lri.fr/~hansen/proceedings/2012/GECCO/companion/...

link

eru 4734 days ago

Why evolve, when you can solve exactly?

link

noprompt 4733 days ago

Actually, frak doesn't generate patterns that require an exact match. I use Clojure's `re-matches` function (for exact matches) in the example and the tests to show and ensure the generated patterns work as expected. Clojure has another function called `re-find` which is a bit more relaxed. An expression like #"bar" will fail with the string "barn" using `re-matches` but succeed with re-find. I plan to include an option for generating regexes that require exact matches.

link