Hacker News new | ask | show | jobs
by bane 4452 days ago
Regex testing is cool, but there are dozens of these kinds of tools and I'd really love to see some other kinds of regex tools

- A list generator. Enter a regex, set repetition operator constraints (e.g. ->{0,3}, +->{1,3}, .->[A-Z0-9 ], etc.) and have it exhaustively generate a list of matching strings. This is helpful when you have a regex that matches your test strings, but also to let you know what else* it'll match. The constraints are to keep it from generating infinite lists. Even if it jams out tens or hundreds of thousands of produced strings, it's still useful. I've found that most people just build up the first regex that will "match" their input text, and move on without thinking about all the edge cases they've just introduced.

- A regex assembler optimizer. Give it a few regexes, have it assemble them into one large regex and optimize it. It's got to do better than just | or'ing all the regexes together. I've seen some work done on using trie variants to do this, but have no idea how far along the work is on this.

- A regex list generator. Give it a list of strings you want to match and have it generate a regex. A sliding "fuzziness" control could tell it to take alternates in the same character position and substitute either

1. Just the characters in the given list - a, t and q in the same position generates a|t|q

2. A representative narrow character range - if I give it a|t|q it knows to use [A-Z] while a|t|q|4 might generate [A-Z0-9]

3. A larger character range, a|t|q might just go ahead and produce [A-Z0-9]

4. An even larger character range, whatever it is, just use .

And maybe another slider for repetitions, so if I end up with [A-Z][A-Z][A-Z], should it just produce [A-Z]{3} or can I go ahead and have it [A-Z]+

Jam the result through an optimizer (see previous idea above) to clean up the regex and maybe even run it through the list generator to check if it produces only what you want.

4 comments

>- A regex assembler optimizer. Give it a few regexes, have it assemble them into one large regex and optimize it. It's got to do better than just | or'ing all the regexes together. I've seen some work done on using trie variants to do this, but have no idea how far along the work is on this.

That should be unnecessary if your regex engine does the dfa transformation. basically, converts the regexp into a state machine and then it combines all of the branches in the state machine to generate synthetic states that can represent the "superposition" of matching multiple branches. this means your regex (once compiled) will run in bounded memory and max time proportional to the input (iirc)

I actually do the combining idea all the time. As long as the language is roughly pcre compatible you can use this to spit out your regex and (if necessary for your alternate language tweak it a bit so it fits).

I've generated some very massive regex's that are quite speedy.

Merger

  https://metacpan.org/pod/Regexp::Assemble
These are also super handy

  https://metacpan.org/pod/Number::Range::Regex
  https://metacpan.org/pod/Regexp::Common
Yeah, Regexp::Assemble was what I had in mind. There's a few that try to generate a list of matching strings from the expression, but I've never been satisfied with their output. Either they're slow, or don't let you constrain the regex, and all of them don't generate comprehensive lists for some reason.
> I'd really love to see some other kinds of regex tools

I'd really love to see a better regex syntax. The current obviously is deficient beyond repair. The tools cannot address the root of the problem.

Why don't you take a crack at it?