Hacker News new | ask | show | jobs
by gopher_protocol 3456 days ago
One of my partner's tasks as a legal assistant is to go through mountains of OCRed PDFs and classify them and extract pieces of data so that lawyers and paralegals can go through them more easily. Do you imagine deep learning would be an appropriate means of automating that, or is it overkill?
3 comments

No not overkill at all - I think the first step is to augment the human so the machine helps them, then it can totally replace all of the laborious stuff as time passes.

Imagine something like: A deep network reduces a 20 page document to a summary of 4 or 5 sentences, you can click on these sentences to "expand" them out, eventually getting to the original text. Saving them from reading the whole document

A separate classifier automatically classifies the document into one of say, 20 categories (or whatever is appropriate).

A Deep Learning named entity recogniser extracts the Human names, Dates and times, Email addresses, Company Names, email addresses, Money amounts, and numbers from each document, then off to elasticsearch for indexing and easy searching.

Then we can start to play with higher level legal concepts that (for example) set precedent, or search for certain logical fallacies. (the next step past machine learning is machine reasoning - and it's starting to be possible now)

Not overkill at all, but youy should know that there are a lot of competitors in the e-discovery software category, and they do a lot besides classification, starting from the ingestion of email archives, through extraction of attachments (recursively, because Outlook PSTs get mailed as attachments too), identifying and attributing quoted text, deduplication of messages and documents, SNA, clustering, normalization, canonicalization, NER, synonym identification (aside from correctly identifying people referred to by nicknames, advanced implementations can also figure out when people are using a code-word to evade filters), etc.
I have built deep learning systems for similar problem domains. I am happy to chat to you about your example. Email me if you get a chance.