Hacker News new | ask | show | jobs
by Oras 565 days ago
One of the challenges I have with RAG is excluding table of contents, headers/footers and appendices from PDFs.

Is there a tool/technique to achieve this? I’m aware that I can use LLMs to do so, or read all pages and find identical text (header/footer), but I want to keep the page number as part of the metadata to ensure better citation on retrieval.

3 comments

Thank you, this is a mix of OCR and LLM, I was thinking if there might be a library to avoid using that.

A better approach will be using Textract as it maintains the flow, such as if you have a table going across multiple pages.

Btw, tesseract is not that good in getting accurate data from tables. Use it with caution especially in financial context.

I have made an open source tool to show missing data from tesseract and easy ocr https://github.com/orasik/parsevision/

Nice I really liked it!
I would check out vision models as a technique to go around OCR errors.

ColPali is the standard implementation & SOTA. Much better than OCR. We maintain a ready to go retrieval API that implements this: https://github.com/tjmlabs/ColiVara

You’ll need other heuristics for ToC and indices but headers/footers are easy to detect via n-gram deduplication. You’ll want to figure out some rolling logic to handle chapter changes though.
Headers/footers are also positional.