Hacker News new | ask | show | jobs
by ori_b 4687 days ago
> It would be cool if the inputs weren't matched exactly, and frak could figure out a general pattern for your inputs (decimals, capitalized words, etc).

The problem is that there are an huge number of possible solutions for any given input. For example, you could always give a trivial solution: ".*"

For something like that to work, you'd need a large dictionary of common patterns, and then you'd want to compare against the dictionary to see if it matches a sequence of common patterns.

I can't imagine that sort of thing being too useful.

3 comments

An alternative solution is to provide a list of matches and non-matches, and look for the shortest or simplest regex that separates the two.

(Finding the actual minimal regex, instead of just a reasonable guess, might be a computationally tough problem. I guess it's in coNP and might be in NP, too. An algorithm in P would be nice to find.)

Update: Finding any separating regex would be in P. A separating finite automaton is easy to find, and then you just convert that into a regex. Now, how do you find the minimal regex?

I really want one of these please message me!!! (tom dot larkworthy at gmail)

preferably python but a link to theory is useful too.

I wrote a brute-force minimal-regex finder once; wish I could remember where I put it.
For your entertainment, https://github.com/matthiasgoergens/Div7 has a regex finder for divisibility. E.g. https://raw.github.com/matthiasgoergens/Div7/master/regex7 checks decimal numbers for divisibility by 7.
Very cute. :-)

I found my code and it wasn't exactly it (it just enumerated all regexes in order of size), but here's a crude solution now: https://github.com/darius/sketchbook/blob/master/regex/find_...

(In defense of my memory, I had written superoptimizers for other things.)

It depends on what the grandparent means with 'not exactly'. If it means 'within a certain edit distance', it is very doable. You can store your dictionary in an automaton and construct a Levenshtein[1] automaton in linear time. The intersection of the dictionary and Levenshtein automaton gives all words within the given edit distance. My library (Dictomaton) that I mention in another comment implements this. There is a different implementation in recent Lucene versions, that is used for finding documents with keywords that approximately match the query.

[1] http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.6...

The problem is that this is unlikely to give what you want, since, for example, if my inputs are "123", "2315", "12451", ..., then what I'm looking for is probably going to be something along the lines of "\d+", and not the set of words that are within a small edit distance of a subset of integers.
It depends on the application. For instance, in search engines approximate matching is very useful. E.g. if people want to search for Rotterdam, they make typos like Roterdam or are foreigner and don't know that Rotterdam has a double t.

You can rank candidates afterwards. However, it's a good solution for finding words that are closed to a mistyped word.

I have successfully used such techniques in a spellchecker and various search engines.

With a set of true and false inputs you could evolve a regex:

https://www.lri.fr/~hansen/proceedings/2012/GECCO/companion/...

Why evolve, when you can solve exactly?