Hacker News new | ask | show | jobs
by nl 744 days ago
Does the llamaindex PDF indexer correctly deal with multi-column PDFs? Most I've seen don't, and you get very odd results because of this.
2 comments

i've made quite good conversions from pdf to markdown with https://github.com/VikParuchuri/marker . it's slow but worth a shot. Markdown should be easily parseable by a rag.

i'm trying to get a similar system setup on my computer.

This looks worth exploring, so thanks. The author has done a bunch of work beyond what PyMuPDF does on multicolumn layouts.
Locally you can choose pypdf or mupdf wich are good but not perfect. If you can send your data online llamaparse is quite good.
Pulling the text out of the PDFs correctly and independently is correct.