|
Disclaimer - Founder of Tensorlake, we built a Document Parsing API for developers. This is exactly the reason why Computer Vision approaches for parsing PDFs works so well in the real world. Relying on metadata in files just doesn't scale across different source of PDFs. We convert PDFs to images, run a layout understanding model on them first, and then apply specialized models like text recognition and table recognition models on them, stitch them back together to get acceptable results for domains where accuracy is table stakes. |
My understanding is that PDFs are intended to produce an output that is consumed by humans and not by computers, the format seems to be focused on how to display some data so that a human can (hopefully) easily read them. Here it seems that we are using a technique that mimics the human approach, which would seem to make sense.
It is sad though that in 30+ years we didn't manage to add a consistent way to include a way to make a PDF readable by a machine. I wonder what incentives were missing that didn't make this possible. Does anyone maybe have some insight here?