Hacker News new | ask | show | jobs
by Klathmon 3675 days ago
Somewhat off topic, but is there a "regex alternation optimizer" out there? And would something like that be worth it?

I've looked through some textmate-style syntax highlighting packages (used in sublime and in github's atom and probably others), and most of them need big (or somewhat big) sets of alternations for a bunch of keywords, and more often than not they are just set up as a list of full keywords with no thought to order or size.

Combining them into something like the below should theoretically be faster while also taking up less space (which is important in web libraries), and I feel like it wouldn't even be all that difficult.

    de(bugger|cimal|clare|f(ault|er)?|init|l(egate|ete)?)
Is there something out there which can do this, and would it even be worth it or is this something best left to the JIT/optimizer of the regex engine?
8 comments

I'm not aware of any tools out of the box, but you could trivially build your example regex from a Trie which is also easy to construct.

https://en.wikipedia.org/wiki/Trie#Algorithms

I'm not so sure it would take up much less space though, if you take gzip compression into account. See for example here:

https://github.com/google/closure-compiler/wiki/FAQ#closure-...

Something like that would probably be better off with Aho-Corasick, with is similar to a trie but ends up with more compact FSMs
Hmm I've never heard of tries before but that looks pretty close to what I'm looking for.

As for the size aspect, it might not make much of a difference in the gzipped filesize but there are other benefits to smaller raw source as well (but honestly I don't have any idea if it would be at all worth anything, I'm assuming no).

Frak [1] does something like that and was written with syntax higlighters in mind.

[1]: https://github.com/noprompt/frak

edit: typo

Wow, this is an interesting idea for a library, thanks @hk__2 for sharing!

TLDR; frak transforms collections of strings into regular expressions for matching those strings. It is available as a command line utility and for the browser as a JavaScript library.

I think emacs has the equivalent built in: `(rx (or "foo" "bar" "baz" "quux"))` gives you: `;; => "\\(?:ba[rz]\\|foo\\|quux\\)"`
Wow that's almost exactly what I had in mind.

I might take a look at this later and see if incorporating this or something like it has any kind of meaningful impact.

Emacs users have used regexp-opt in their own syntax highlighting (aka font-lock) for a while now, at least. If you wanted to port it, it'd depend on whether you're okay with the license, presumably, though I don't think writing a simple version from scratch would be that hard. (Do not let your eyes jump to the source block in the manual and then assume that it's not what you asked: that example is specifically described as the inefficient equivalent.) http://www.gnu.org/software/emacs/manual/html_node/elisp/Reg...
While I don't know the internals of different regex engines, anything that sounds so simple will probably be done by a good regex compiler/engine anyway. The simple | separated list of keywords seems a lot less error prone.

An 2009 article on the V8 blog about a regex optimization they did: http://blog.chromium.org/2009/02/irregexp-google-chromes-new...

That's where a tool like this could come in handy.

A general purpose regex optimizer can't make assumptions like you don't care about the ordering of sub-groups, but a tool like this can.

If you are just looking for the fastest way to to match any 1 of these 40 keywords, this could make a fast regex that can minimize backtracking.

Like I said, I have no idea if it would make any real difference, but if it did incorporating something like that into a build process for syntax highlighting packages could improve performance for a bunch of people.

You are looking for Spintax ! That's how bots do comment-spam.

Example : "{|||Wow! |Hey! |Looks {nice|cool}! }{{I like|I love|Like|Love} {this|your}|{Very nice|Nice|Cool}} {photo|foto|picture|pic|image|img}{.|!|!!| :)| ;)| :D| <3}"

That’s not what OP asked for because it’s not optimized. You can e.g. optimize "I (love|like)" as "I l(ov|ik)e".
Allan Odgaard, creator of TextMate, made one in 2005: https://github.com/textmate/bundle-development.tmbundle/blob...

It’s a command in the “bundle development” bundle.

A friend of mine just wrote an article on this:

http://engineering.khanacademy.org/posts/shortest-regex.htm

Yup! @satyr, best? known for their Coco language, has a tool that does just that: http://satyr.github.io/retrie/