Hacker News new | ask | show | jobs
by tastyminerals 2292 days ago
Yes, most invoices are in PDF but only about 40% of them are native PDF meaning they are actual documents not scanned images converted to PDFs. There are are also compound PDF invoices which contain images. So, in order to extract data from them, one needs not only good PDF parser but an OCR engine too.
3 comments

This is a huge pet peeve of mine. Most invoices are generated on a computer (often in Word) but a huge fraction of the people who generate them don't know how to export to a PDF. So they print the invoice on paper, scan it back in to a PDF, and email that to you. Thus the proliferation of bitmap PDFs.
> So, in order to extract data from them, one needs not only good PDF parser but an OCR engine too.

You can go further. Invoices often contain block sections of text with important terms of the invoice, such as shipping time information, insurance, warranties, etc. To build something that works universally, you also need very good natural language processing.

If you're using an OCR engine to understand PDFs that are nothing but a scanned image embedded in a PDF... what do you need a PDF parser for? You can always just render an image of a document and then use that.
> If you're using an OCR engine to understand PDFs that are nothing but a scanned image embedded in a PDF... what do you need a PDF parser for?

This should be obvious, but the answer is because OCR engines are not terribly accurate. If you have a native PDF, you're far better off parsing the PDF then converting to an image and OCRing. But if OCR ever becomes perfect, then sure.

For accuracy and speed. The market SOTA Abbyy is far from being accurate.
> The market SOTA Abbyy is far from being accurate.

While Abbyy is likely the best, it's also incredibly expensive. Roughly on the order of $0.01/page or maybe at best a tenth of that in high volume.

For comparison, I run a bunch of OCR servers using the open source tesseract library. The machine-time on one of the major cloud providers works out to roughly $0.01 for 100-1000 pages.

OCR.space charges only $10 for 100,000 conversions. The quality is good, but not as good as Abbyy.
It is the best and this is one of the reasons why PDF extraction is hard :)