Y
Hacker News
new
|
ask
|
show
|
jobs
by
nl
744 days ago
Does the llamaindex PDF indexer correctly deal with multi-column PDFs? Most I've seen don't, and you get very odd results because of this.
2 comments
rspoerri
744 days ago
i've made quite good conversions from pdf to markdown with
https://github.com/VikParuchuri/marker
. it's slow but worth a shot. Markdown should be easily parseable by a rag.
i'm trying to get a similar system setup on my computer.
link
nl
744 days ago
This looks worth exploring, so thanks. The author has done a bunch of work beyond what PyMuPDF does on multicolumn layouts.
link
pierre
744 days ago
Locally you can choose pypdf or mupdf wich are good but not perfect. If you can send your data online llamaparse is quite good.
link
j45
744 days ago
Pulling the text out of the PDFs correctly and independently is correct.
link
i'm trying to get a similar system setup on my computer.