Hacker News new | ask | show | jobs
by conartist6 95 days ago
@ievev Have you ever seen an implementation like @bablr/regex? https://github.com/bablr-lang/regex-vm It's an NFA system so it isn't going to be winning any awards for throughput, but in this particular case it does seem to completely avoid the complexity blowup. It will run your heap out of memory though on really big inputs.

The strategy this engine uses is just to evolve the state as a function of time. A match can be successfully completed, yet not be emitted because some other longer match could still supercede it by being longer or more leftmost.

I tried the pattern /d+s+/g on 10,000,000 digits followed by no space. It took 4 seconds to return no results. I tried it on 20,000,000 digits followed by no space. It took 8 seconds to return no results. I tried on 100,000,000 and I ran out of heap space.

Test setup: https://gist.github.com/conartist6/051838025af1e04d966e03aa9...

3 comments

Returning no results is going to be linear in any DFA or NFA based implementation, though. You go character by character, and confirm that there are no matches.

It's only when you return multiple matches that the engines have a problem and become superlinear.

It might be possible to use a randomised algorithm to estimate the number of matches in only linear time.
Do any RE implementations do anything like the query planning that databases do or the rewrites that compilers do that can replace the RE with a different sequence of REs or string searches that might execute faster?

For example in the expression in your example (I'm assuming based on your description of the test data that /d+s+/ means the same as /\d+\s+/ in the RE engines I've used) any match must contain a digit followed by a space.

A scan for all places where a digit is followed by whitespace with each such place then being checked to find the length of the string of whitespace starting there and the length of the string of digits ending there should be linear time and constant heap space.

Not an answer, but related, I was looking at Memgraph (graph db) query planning for Cypher, which is similar (ish) to what you are asking about:

https://memgraph.com/docs/querying/query-plan

The reason why I was looking was to do query planning for a declarative pattern-finding in atomic structures (hierarchical labelled graphs, I suppose) although I'm slowly realising just how insanely hard it might be to get it to work efficiently!

You're correct, I accidentally omitted backslashes on \d and \s. I checked and the pattern was correct during the test.
I have not heard of this before, i will have a look!