Hacker News new | ask | show | jobs
by pionar 1084 days ago
Are there any surprises here? For me, it's how well the .NET regex lib does in compiled mode.
2 comments

I agree. I did not expect .NET to do so well.

Another surprise is perhaps the quadratic benchmark[1]. Even for automata oriented engines, it provokes a case that is unavoidably O(m * n^2). Basically, the automata oriented engines (rust/regex, go/regexp and re2) all have worst case O(m * n) time complexity for an individual search. But when you move to "iterate over all matches in a haystack," that worst case gets bumped up to O(m * n^2) for certain classes of regexes.

All regex engines in the barometer, except for Hyperscan, suffer from this problem. That's not because of any special secret in Hyperscan, but rather, that it implements different match semantics than every other engine.

I'm otherwise finding it hard to say what is "surprising" haha. Perhaps others from the outside can comment.

[1]: https://github.com/BurntSushi/rebar#quadratic

There has been quite a bit of effort in improving regex performance in .NET in the last few years. Stephen Toub has a few interesting (and long) articles:

https://devblogs.microsoft.com/dotnet/regex-performance-impr...

https://devblogs.microsoft.com/dotnet/regular-expression-imp...

From what I remember, there are a bunch of distinct optimizations:

• transforming the regex into a functionally equivalent one with atomic groups to prevent backtracking when it's statically known that backtracking cannot change the result (but the runtime)

• diverting most parts of the actual search to string.IndexOf, IndexOfAny, etc. methods that have been heavily vectorized

• trying to find clever ways in using said methods in finding good starting our continuation points for the match and matching either backwards or forwards as needed to retain most of the vectorized searching benefits with as little unnecessary backtracking as possible (e.g. /a+b/ will likely generate code to search for b and then match backwards instead of trying to find every a and continuing forwards from there)

I guess most of those ideas are neither new nor uncommon (especially in faster engines where I think they take inspiration from), but compared to a rather similar language .NET has quite an edge by now.