| I am tired of postprocessing OCR. I have used many OCR solutions - Tesseract (4 and 5), EasyOCR, the TrOCR(not-document level), DocTR and Paddle-Paddle (self-hostable on GPUs), and lastly Textract(best). Some are just about fast enough to be useful in production for long documents, but all have one thing in common:
- You need to preprocess so much! Why in this day and age do they all tend to output lines or words of text, completely leaving things like sorting out which text goes in which column or which bullet point is a new sentence? I know solutions like GROBID solve this by correctly processing columns etc for papers, but for general documents, it seems so unsolved. Are there good maintained solutions to this? At a team I am on, we spent a long time on an internal solution, which works well, and seeing the performance difference from raw processing to proper processing (formatting text and other improvements) has been -night-and-day- So why don't providers or producers add steps to tidy up generic formats? PS: I haven't found GPT APIs to be great for this, because the location and size of text is often crucial for columns and subheaders. |
Some papers of relevance:
The first one is for publications. From the abstract: "...the PubLayNet dataset for document layout analysis by automatically matching the XML representations and the content of over 1 million PDF articles that are publicly available on PubMed Central. The size of the dataset is comparable to established computer vision datasets, containing over 360 thousand document images, where typical document layout elements are annotated".The second is for documents. It contains 80K manually annotated pages from diverse data sources to represent a wide variability in layouts. For each PDF page, the layout annotations provide labelled bounding-boxes with a choice of 11 distinct classes. DocLayNet also provides a subset of double- and triple-annotated pages to determine the inter-annotator agreement.