FYI: You should reply using the reply links. It's easier to follow threads that way.
Also, does the matching handle blanks or minor variations? Typical clause libraries often taken this form: "The Company hereby agrees to sell you [_________________] shares of stock." You probably wouldn't need NLP to match that.
Sorry about not using the reply. I was wondering why my comment was on top.
Right now I've got some off the shelf NLP stuff that does Org and Name recognition to remove those things (I've been working on a similar project for a while). The lines should be trivial but not yet implemented. Most "get screwed" clauses don't have underlines as they are boiler plate.
Question for you. How do you handle legal copy that isn't in anyone else's document? Or does this not happen often enough?
Even if there's some unique copy, are there plans to let you know which lines they are in your document? It would be useful to give those lines to a lawyer for review instead of an entire document.
In the future I'd hope to tie in the NLP stuff we've been working on to do fuzzy matching.
We're highlighting text we've seen before, so it should be easy to see which lines are actually unique text.