| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by varunneal 406 days ago
	I've been hacking away at trying to process PDFs into Markdown, having encountered similar obstacles to OP regarding header detection (and many other issues). OCR is fantastic these days but maintaining a global structure to the document is much trickier. Consistent HTML seems still out of reach for large documents. I'm having half-decent results with Markdown using multiple passes of an LLM to extract document structure and feeding it in contextually for page-by-pass extraction.

1 comments

dstryr 406 days ago

Give this project a try. I've been using it with promising results.

https://github.com/matthsena/AlcheMark

link

aorth 406 days ago

I tried with one PDF and was surprised to see it connect to some cloud service:

  2025-05-14 07:58:49,373 - urllib3.connectionpool - DEBUG - Starting new HTTPS connection (1): openaipublic.blob.core.windows.net:443
  2025-05-14 07:58:50,446 - urllib3.connectionpool - DEBUG - https://openaipublic.blob.core.windows.net:443 "GET /encodings/o200k_base.tiktoken HTTP/1.1" 200 361 3922

The project's README doesn't mention that anywhere...

link

degamad 405 days ago

The project's README mentions that it uses tiktoken[0], which is a separate project created by OpenAI.

tiktoken downloads token models the first time you use them, but it does not mention that. It does cache the models, so you shouldn't see more of those connections, if I'm understanding the code correctly.

[0] <https://github.com/openai/tiktoken>

link

varunneal 405 days ago

I'll check it out!

link