| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tastyminerals 2339 days ago
	Yes, most invoices are in PDF but only about 40% of them are native PDF meaning they are actual documents not scanned images converted to PDFs. There are are also compound PDF invoices which contain images. So, in order to extract data from them, one needs not only good PDF parser but an OCR engine too.

3 comments

dreamcompiler 2338 days ago

This is a huge pet peeve of mine. Most invoices are generated on a computer (often in Word) but a huge fraction of the people who generate them don't know how to export to a PDF. So they print the invoice on paper, scan it back in to a PDF, and email that to you. Thus the proliferation of bitmap PDFs.

link

speedplane 2339 days ago

> So, in order to extract data from them, one needs not only good PDF parser but an OCR engine too.

You can go further. Invoices often contain block sections of text with important terms of the invoice, such as shipping time information, insurance, warranties, etc. To build something that works universally, you also need very good natural language processing.

link

thaumasiotes 2339 days ago

If you're using an OCR engine to understand PDFs that are nothing but a scanned image embedded in a PDF... what do you need a PDF parser for? You can always just render an image of a document and then use that.

link

speedplane 2339 days ago

> If you're using an OCR engine to understand PDFs that are nothing but a scanned image embedded in a PDF... what do you need a PDF parser for?

This should be obvious, but the answer is because OCR engines are not terribly accurate. If you have a native PDF, you're far better off parsing the PDF then converting to an image and OCRing. But if OCR ever becomes perfect, then sure.

link

tastyminerals 2339 days ago

For accuracy and speed. The market SOTA Abbyy is far from being accurate.

link

speedplane 2339 days ago

> The market SOTA Abbyy is far from being accurate.

While Abbyy is likely the best, it's also incredibly expensive. Roughly on the order of $0.01/page or maybe at best a tenth of that in high volume.

For comparison, I run a bunch of OCR servers using the open source tesseract library. The machine-time on one of the major cloud providers works out to roughly $0.01 for 100-1000 pages.

link

bhanhfo 2338 days ago

OCR.space charges only $10 for 100,000 conversions. The quality is good, but not as good as Abbyy.

link

minerals29 2339 days ago

It is the best and this is one of the reasons why PDF extraction is hard :)

link