Hacker News new | ask | show | jobs
by abc03 3639 days ago
Does anyone know what methods are state of the art in machine learning for data extraction in general or where I could get an overview (invoices,images, documents etc.)?
4 comments

Research papers often use the phrase "wrapper induction" when discussing automatic data extraction. Once I discovered that phrase I found several papers. I'm on mobile so can't link any.
Here is a '12 survey: http://arxiv.org/abs/1207.0246
We would need more context/information about your specific objectives.

- document conversion (pdftotext, pdfbox, apache tabula, etc.)

- OCR (tesseract, pypdfocr, etc.)

- Named-Entity-Recognition (NER) i.e. finding and recognizing entities in text (DBPedia Spotlight, stanford NER via NLTK, spacy)

- coreference resolution, dependency parsing (spacy, syntaxnet)

Thanks. Some great keywords to investigate. I'm namely interested in two areas at the moment: - invoices (I guess NER would be partially an Option) - web scrapping (wrapper induction)
There are many methods for different tasks. What do you mean by 'data extraction', do you have some specific examples?
After the OCR of documents, I have used mostly regex to extract information from semi-structured documents. One example would be invoices (invoice number, total amount etc.), another would be to extract product names, SKU numbers etc. from various documents.