|
|
|
|
|
by varunneal
406 days ago
|
|
I've been hacking away at trying to process PDFs into Markdown, having encountered similar obstacles to OP regarding header detection (and many other issues). OCR is fantastic these days but maintaining a global structure to the document is much trickier. Consistent HTML seems still out of reach for large documents. I'm having half-decent results with Markdown using multiple passes of an LLM to extract document structure and feeding it in contextually for page-by-pass extraction. |
|
https://github.com/matthsena/AlcheMark