| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jasode 2692 days ago

Like the other sibling comments mentioned, I too was confused about "100x faster than regex" and what the actual product was about.

After digging around their website, I found this blog post which explains it better:

https://blog.nezaboodka.com/post/2019/594-using-nevod-for-te...

So my summary would be:

1) it works "faster" than regex in a specific scenario of treating text as entities in natural language. (E.g. higher conceptual abstractions such as qualifiers, variations, etc). If one were to reconstruct Nevod's rules using pure traditional regex (a very complicated regex), executing that regex would be slower because it's more of a "dumb" character sequence matching engine instead of a higher level natural language parser.

2) it's currently an unreleased "text search engine" that presumably will be licensed for you to integrate into your own software. The text matching engine is currently only used in their proprietary database engine. Whether the engine is a library one statically links in like Sqlite -- or -- it's a separate runtime like ElasticSearch that you make API calls to, I don't know.

I notice the CEO is Yury Chetyrko and the submitter is ychetyrko, so maybe he can explain in more detail what exactly Nevod is.

2 comments

bazzargh 2692 days ago

My reading of that is that they compared apples and oranges. They didn't use NLP there at all (stemming, parts of speech tagging, etc); they just relaxed handling of whitespace. The Nevod 'equivalent' was a less general expression to make it seem more maintainable.

The Nevod example translates to (ruby):

    pattern = Regexp.new( "(?<Name>ejection fraction|LVEF)( by visual inspection)?
                           (?<Qualifier>(is|of)( (at least|about|greater than|less than|equal to))?)
                           (?<Value>[0-9]+(-[0-9]+)?|normal|moderate|severe)".gsub(/\s+/, "\\s+"), 
                         Regexp::IGNORECASE)

    ["ejection fraction is at least 70-75",
    "ejection  fraction of about 20",
    "ejection fraction  of 60",
    "ejection  fraction of greater than 65",
    "ejection fraction of 55",
    "ejection fraction by visual  inspection is 65",
    "LVEF is normal"].each do |line|
      puts line
      puts pattern.match(line).inspect
    end

The only trick I used was to substitute literal whitespace in the regex with a whitespace pattern, so that the typed regex was more readable.

link

mikecapizzi 2692 days ago

If you're looking for a tool that allows you to incorporate legitimate NLP approaches, you should have a look at `odin`. Here's a paper https://doi.org/10.1093/database/bay098 showing its usage in the medical domain.

And the code is open-sourced as part of the `processors` library out of the CLULab at the University of Arizona: https://github.com/clulab/processors

The most detailed (though not completely up-to-date) documentation is probably in the manual here: https://arxiv.org/abs/1509.07513

I'm using it at my current job to build an analysis tool for customer-agent phone calls.

It allows you to build rules that match on different levels of abstraction: tokens, pos-tags, dependency paths. You can even match tokens based on word similarity (as measured by cosine similarity of word vectors).

And these rules can "cascade" (i.e. build off of each other). So you can find an entity or event in rule 1 and then look for how that interacts with another matched entity or event in a later rule.

link

Cu3PO42 2692 days ago

That sounds very much like what Rust's RegEx engine does. I understand it extracts longer literals (when available) to find a starting point for where a match might be according to [1].

[1] https://blog.burntsushi.net/ripgrep/#literal-optimizations

link

glangdale 2692 days ago

This is not a new technique with the Rust regex engine. We had a considerably more comprehensive literal 'factoring' approach in Hyperscan about a decade earlier (which also satisfied a lifelong ambition of mine; specifically misusing the netflow algorithm in a graph for something).

The multiple literal implementation in that matcher is also a partial lift from Hyperscan's "Teddy" small group literal matcher (not salty about that as "burntsushi" has been very clear about inspiration). I do wish they'd pull in the rest of the algorithm at some point - the bits that allow merging of Teddy literals into buckets to reduce false positive rates...

link

burntsushi 2692 days ago

While we're wishing for things, I'd love to see you write up how some of the algorithms (like Teddy) work. Hyperscan is clearly a treasure trove of them, but the effort required to extract its secrets is quite large! For example, I've spent quite a bit of time perusing its code (and your excellent mailing list posts), and I still don't think I could describe its execution flow at a high level.

As far as literals go, I was doing that well before I heard about Hyperscan. I got the idea from hints in the literature. (On mobile or else I'd provide a link.)

link

glangdale 2691 days ago

We have a paper accepted to NSDI, so this will hopefully shed some light on things. A conference paper is, sadly, too small to fit more than a few subsystems so a writeup of Teddy isn't included.

As I don't work for Intel any more, it's unlikely that I will put in the effort for anything significantly more comprehensive. I am considering another building another regex matcher but it wouldn't be the encyclopedic list of optimizations approach. Paul Terra refers to Hyperscan as the "Dwarf Fortress of regular expression matchers". I have some new ideas I'd like to try out, in any case.

I wasn't suggesting that your use of literal factoring comes from Hyperscan, only the Teddy matcher. Literal factoring seems an old technique and I don't know the genuinely first cite of that.

link

burntsushi 2691 days ago

Gotya. Thanks for the clarification! I look forward to reading your paper! I'll hold out hope for the folks working on Hyperscan at Intel to do some more in depth write ups.

link

Cu3PO42 2691 days ago

I didn't mean to insinuate that it was. The Rust RegEx engine was simply the first that came to mind for me. What I did intend to say that if this is indeed the optimisation that Nevod claims it has, it is not the first to do it.

However I did take the time to read up on Hyperscan and would like to thank you for contributing to what looks like an excellent tool.

link