| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Oras 565 days ago
	One of the challenges I have with RAG is excluding table of contents, headers/footers and appendices from PDFs. Is there a tool/technique to achieve this? I’m aware that I can use LLMs to do so, or read all pages and find identical text (header/footer), but I want to keep the page number as part of the metadata to ensure better citation on retrieval.

3 comments

prsdm 565 days ago

This might help you: https://github.com/langchain-ai/langchain/blob/master/cookbo...

link

Oras 565 days ago

Thank you, this is a mix of OCR and LLM, I was thinking if there might be a library to avoid using that.

A better approach will be using Textract as it maintains the flow, such as if you have a table going across multiple pages.

Btw, tesseract is not that good in getting accurate data from tables. Use it with caution especially in financial context.

I have made an open source tool to show missing data from tesseract and easy ocr https://github.com/orasik/parsevision/

link

prsdm 565 days ago

Nice I really liked it!

link

jonathan-adly 565 days ago

I would check out vision models as a technique to go around OCR errors.

ColPali is the standard implementation & SOTA. Much better than OCR. We maintain a ready to go retrieval API that implements this: https://github.com/tjmlabs/ColiVara

link

throwup238 565 days ago

You’ll need other heuristics for ToC and indices but headers/footers are easy to detect via n-gram deduplication. You’ll want to figure out some rolling logic to handle chapter changes though.

link

ellisv 565 days ago

Headers/footers are also positional.

link