| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Klathmon 3675 days ago

Somewhat off topic, but is there a "regex alternation optimizer" out there? And would something like that be worth it?

I've looked through some textmate-style syntax highlighting packages (used in sublime and in github's atom and probably others), and most of them need big (or somewhat big) sets of alternations for a bunch of keywords, and more often than not they are just set up as a list of full keywords with no thought to order or size.

Combining them into something like the below should theoretically be faster while also taking up less space (which is important in web libraries), and I feel like it wouldn't even be all that difficult.

    de(bugger|cimal|clare|f(ault|er)?|init|l(egate|ete)?)

Is there something out there which can do this, and would it even be worth it or is this something best left to the JIT/optimizer of the regex engine?

8 comments

julian37 3675 days ago

I'm not aware of any tools out of the box, but you could trivially build your example regex from a Trie which is also easy to construct.

https://en.wikipedia.org/wiki/Trie#Algorithms

I'm not so sure it would take up much less space though, if you take gzip compression into account. See for example here:

https://github.com/google/closure-compiler/wiki/FAQ#closure-...

ybx 3675 days ago

Something like that would probably be better off with Aho-Corasick, with is similar to a trie but ends up with more compact FSMs

Klathmon 3675 days ago

Hmm I've never heard of tries before but that looks pretty close to what I'm looking for.

As for the size aspect, it might not make much of a difference in the gzipped filesize but there are other benefits to smaller raw source as well (but honestly I don't have any idea if it would be at all worth anything, I'm assuming no).

hk__2 3675 days ago

Frak [1] does something like that and was written with syntax higlighters in mind.

[1]: https://github.com/noprompt/frak

edit: typo

jaytaylor 3675 days ago

Wow, this is an interesting idea for a library, thanks @hk__2 for sharing!

TLDR; frak transforms collections of strings into regular expressions for matching those strings. It is available as a command line utility and for the browser as a JavaScript library.

sooheon 3675 days ago

I think emacs has the equivalent built in: `(rx (or "foo" "bar" "baz" "quux"))` gives you: `;; => "\\(?:ba[rz]\\|foo\\|quux\\)"`

Klathmon 3675 days ago

Wow that's almost exactly what I had in mind.

I might take a look at this later and see if incorporating this or something like it has any kind of meaningful impact.

dasyatidprime 3675 days ago

Emacs users have used regexp-opt in their own syntax highlighting (aka font-lock) for a while now, at least. If you wanted to port it, it'd depend on whether you're okay with the license, presumably, though I don't think writing a simple version from scratch would be that hard. (Do not let your eyes jump to the source block in the manual and then assume that it's not what you asked: that example is specifically described as the inefficient equivalent.) http://www.gnu.org/software/emacs/manual/html_node/elisp/Reg...

jankassens 3675 days ago

While I don't know the internals of different regex engines, anything that sounds so simple will probably be done by a good regex compiler/engine anyway. The simple | separated list of keywords seems a lot less error prone.

An 2009 article on the V8 blog about a regex optimization they did: http://blog.chromium.org/2009/02/irregexp-google-chromes-new...

Klathmon 3675 days ago

That's where a tool like this could come in handy.

A general purpose regex optimizer can't make assumptions like you don't care about the ordering of sub-groups, but a tool like this can.

If you are just looking for the fastest way to to match any 1 of these 40 keywords, this could make a fast regex that can minimize backtracking.

Like I said, I have no idea if it would make any real difference, but if it did incorporating something like that into a build process for syntax highlighting packages could improve performance for a bunch of people.

simlevesque 3675 days ago

You are looking for Spintax ! That's how bots do comment-spam.

Example : "{|||Wow! |Hey! |Looks {nice|cool}! }{{I like|I love|Like|Love} {this|your}|{Very nice|Nice|Cool}} {photo|foto|picture|pic|image|img}{.|!|!!| :)| ;)| :D| <3}"

hk__2 3675 days ago

That’s not what OP asked for because it’s not optimized. You can e.g. optimize "I (love|like)" as "I l(ov|ik)e".

jacobolus 3675 days ago

Allan Odgaard, creator of TextMate, made one in 2005: https://github.com/textmate/bundle-development.tmbundle/blob...

It’s a command in the “bundle development” bundle.

spicyj 3675 days ago

A friend of mine just wrote an article on this:

http://engineering.khanacademy.org/posts/shortest-regex.htm

akx 3675 days ago

Yup! @satyr, best? known for their Coco language, has a tool that does just that: http://satyr.github.io/retrie/