If you're using an OCR engine to understand PDFs that are nothing but a scanned image embedded in a PDF... what do you need a PDF parser for? You can always just render an image of a document and then use that.
> If you're using an OCR engine to understand PDFs that are nothing but a scanned image embedded in a PDF... what do you need a PDF parser for?
This should be obvious, but the answer is because OCR engines are not terribly accurate. If you have a native PDF, you're far better off parsing the PDF then converting to an image and OCRing. But if OCR ever becomes perfect, then sure.
> The market SOTA Abbyy is far from being accurate.
While Abbyy is likely the best, it's also incredibly expensive. Roughly on the order of $0.01/page or maybe at best a tenth of that in high volume.
For comparison, I run a bunch of OCR servers using the open source tesseract library. The machine-time on one of the major cloud providers works out to roughly $0.01 for 100-1000 pages.
This should be obvious, but the answer is because OCR engines are not terribly accurate. If you have a native PDF, you're far better off parsing the PDF then converting to an image and OCRing. But if OCR ever becomes perfect, then sure.