Hacker News new | ask | show | jobs
by ahljoh 3637 days ago
We would need more context/information about your specific objectives.

- document conversion (pdftotext, pdfbox, apache tabula, etc.)

- OCR (tesseract, pypdfocr, etc.)

- Named-Entity-Recognition (NER) i.e. finding and recognizing entities in text (DBPedia Spotlight, stanford NER via NLTK, spacy)

- coreference resolution, dependency parsing (spacy, syntaxnet)

1 comments

Thanks. Some great keywords to investigate. I'm namely interested in two areas at the moment: - invoices (I guess NER would be partially an Option) - web scrapping (wrapper induction)