Hacker News new | ask | show | jobs
by UnoriginalGuy 4139 days ago
Looks like Linq (from .Net/C#). Pretty sexy way to write Regular Expressions if you ask me.

I've "learned" regular expressions multiple times but it just never sticks, I have no idea why. It certainly doesn't help that there are several different incompatible syntaxes (so what I remember and think "should" work doesn't).

I'd prefer to write RegX's in this style, however I would pay attention to performance (not that Regular Expressions are high performance, however I wouldn't want to see a large performance loss either).

4 comments

Regular expressions are high performance if you use automata style(Regular Language) regular expressions, which limits the use of some of the features you can use.

Modern regular expression engines in a lot of languages, actually go beyond the expressiveness of a regular language. This is what damages performance.

There is no reason why this would reduce performance... if its not doing anything crazy.

If anything your taking work away from it. Your building the tree directly here, where as parser would normally build a tree from the string. But since this is integrating into the languages RE library i'm guessing its writing that tree as a string, which is then passed into the regular expression engine, to be turned into a tree again :)

I guess it depends on your definition of "high performance."

If a regular expression runs too often, even pre-compiled (as they should be), you'll want to replace them with code written in the native language. I've gone in and replaced a one line search/replace written in RegX (compiled), with just a C-style for() loop over the wchar array, and had the memory usage drop by near 80% and performance increase by over 60%.

So high performance is all relative. However RegX isn't something I'd describe that way, even compiled. It is a nice way to write complex string parsing code quickly however.

If your regex is replaceable by a simple "find_substring" or equivalent, it's slow.

If your regex is complicated, it will probably beat any naive attempt to write it into conventional string processing, short of reimplementing regexs in the first place. Especially since in many languages, "conventional string processing" may involve the creation of lots of copies and sub-copies.

> you'll want to replace them with code written in the native language

Probably not true for Javascript (and other scripted languages) - matching regex uses native and highly optimized regex lib, which will usually be orders of magnitude faster than implementing this in the language.

That isn't relevant in this context as the library linked couldn't be integrated into JavaScript.
Sorry, which library do you mean? The OP is a javascript library..

I just wanted to point out that regex is much faster in javascript than doing things 'by hand'.

A regular expression implemented as a DFA would literally be looping over the string, and a state transition table. I don't see how performance could be bad.

It is highly dependent on the regular expression engine you use, most don't use automata because of extra features.

Perl 6 unifies "regexes" and recursive descent grammars at the syntax level and then compiles them to a hybrid of DFAs, NFAs, and regular code as necessary. The idea is to maintain the simplicity of simple regexes and of parser combinators for the user but retain as much of the performance benefits of true DFA-able regular expressions as is possible.

At least, that's the theory. In practice, while the benefit of syntactic usability is available today, the Perl 6 rules engine is still very slow and it'll likely take years to optimize the heck out of this approach and really harvest the performance benefit.

I wasn't sure to be impressed or horrified when I learned that Perl supported recursive regular expressions.
If you're interested in something similar for .NET / C#, check out my Regextra library, specifically the Passphrase Regex Builder: https://github.com/amageed/Regextra

As the name suggests though, the focus was on passphrase criteria and it wasn't to produce a DSL for general regex building. The library also supports named templates and a few utility methods.

This is why I dislike the design of Linq. The pattern of chaining function calls to implement a DSL is common enough that they should have employed a general solution, not just a wonky SQL-specific version.
LINQ isn't SQL-specific and does apply generally. It can be used against the standard .NET framework objects and collections. There are different LINQ focuses or flavors, and there are 2 ways to write queries. There's LINQ to Objects, LINQ to XML, LINQ to SQL (no longer actively maintained; nowadays Entity Framework is the Microsoft alternative), and you can write your own LINQ providers to target other purposes.

As for syntax, there's the fluent syntax (chained methods), and there's the query syntax which is syntactic sugar that gets compiled to the methods. The query syntax is probably the biggest reason people mistake LINQ for being SQL specific since it resembles SQL.

E.g.,

  var results = SomeCollection.Where(c => c.SomeProperty < 10)
                              .Select(c => new { c.SomeProperty, c.OtherProperty });
The same thing in query syntax:

  var results = from c in SomeCollection
                where c.SomeProperty < 10
                select new { c.SomeProperty, c.OtherProperty };
Then you can iterate over both the same way:

  foreach (var result in results)
  {
      Console.WriteLine(result);
  }
Performance is unaffected. This provides a fluent and verbose way of building a regular expression. Users of the library then feed the built regular expression into their standard regular expression engine.