| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by abc03 3639 days ago
	Does anyone know what methods are state of the art in machine learning for data extraction in general or where I could get an overview (invoices,images, documents etc.)?

4 comments

Buttons840 3639 days ago

Research papers often use the phrase "wrapper induction" when discussing automatic data extraction. Once I discovered that phrase I found several papers. I'm on mobile so can't link any.

link

rlndmx 3639 days ago

Here is a '12 survey: http://arxiv.org/abs/1207.0246

link

ahljoh 3638 days ago

We would need more context/information about your specific objectives.

- document conversion (pdftotext, pdfbox, apache tabula, etc.)

- OCR (tesseract, pypdfocr, etc.)

- Named-Entity-Recognition (NER) i.e. finding and recognizing entities in text (DBPedia Spotlight, stanford NER via NLTK, spacy)

- coreference resolution, dependency parsing (spacy, syntaxnet)

link

abc03 3638 days ago

Thanks. Some great keywords to investigate. I'm namely interested in two areas at the moment: - invoices (I guess NER would be partially an Option) - web scrapping (wrapper induction)

link

kmike84 3639 days ago

There are many methods for different tasks. What do you mean by 'data extraction', do you have some specific examples?

link

abc03 3639 days ago

After the OCR of documents, I have used mostly regex to extract information from semi-structured documents. One example would be invoices (invoice number, total amount etc.), another would be to extract product names, SKU numbers etc. from various documents.

link