| I sort of agree... I do the same. We also parse millions of PDFs per month in all kinds languages (both Western and Asian alphabets). Getting the basics of PDF parsing to work is really not that complicated -- A few months work. And is an order of magnitude more efficient than generating an image in 300-600 DPI and doing OCR or Visual LLM. But some of the challenges (which we have solved) are: • Glyphs to unicode tables are often limited or incorrect
• "Boxing" blocks of text into "paragraphs" can be tricky
• Handling extra spaces and missing spaces between letters and words. Often PDFs do not include the spaces or they are incorrect so you need to identify gaps yourself.
• Often graphic designers of magazines/newspapers will hide text behind e.g. a simple white rectangle, and place new version of the text above. So you need to keep track of z-order and ignore hidden text.
• Common text can be embedded as vector paths -- Not just logos but we also see it with text. So you need a way to handle that.
• Dropcap and similar "artistic" choices can be a bit painful There are lot of other smaller issues -- but they are generally edge cases. OCR handles some of these issues for you. But we found that OCR often misidentifies letters (all major OCR), and they are certainly not perfect with spaces either. So if you are going for quality, you can get better results if you parse the PDFs. Visual Transformers are not good with accurate coordinates/boxing yet -- At least we haven't seen a good enough implementation of it yet. Even though it is getting better. |