| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by thaumasiotes 2339 days ago
	If you're using an OCR engine to understand PDFs that are nothing but a scanned image embedded in a PDF... what do you need a PDF parser for? You can always just render an image of a document and then use that.

2 comments

speedplane 2339 days ago

> If you're using an OCR engine to understand PDFs that are nothing but a scanned image embedded in a PDF... what do you need a PDF parser for?

This should be obvious, but the answer is because OCR engines are not terribly accurate. If you have a native PDF, you're far better off parsing the PDF then converting to an image and OCRing. But if OCR ever becomes perfect, then sure.

link

tastyminerals 2339 days ago

For accuracy and speed. The market SOTA Abbyy is far from being accurate.

link

speedplane 2339 days ago

> The market SOTA Abbyy is far from being accurate.

While Abbyy is likely the best, it's also incredibly expensive. Roughly on the order of $0.01/page or maybe at best a tenth of that in high volume.

For comparison, I run a bunch of OCR servers using the open source tesseract library. The machine-time on one of the major cloud providers works out to roughly $0.01 for 100-1000 pages.

link

bhanhfo 2338 days ago

OCR.space charges only $10 for 100,000 conversions. The quality is good, but not as good as Abbyy.

link

minerals29 2339 days ago

It is the best and this is one of the reasons why PDF extraction is hard :)

link