Hacker News new | ask | show | jobs
by dimitri-vs 332 days ago
Have you tried datalab-to/marker with the "Use LLM" option? They have a playground you can test it out on https://www.datalab.to/playground but I use their local CLI option: https://github.com/datalab-to/marker

I just tried it on a fairly ugly TD Bank statement PDF I have and the markdown of the whole PDF (tables and all) is very accurate. Here is the config I use:

marker_single --format_lines --use_llm --llm_service marker.services.gemini.GoogleGeminiService --gemini_model_name gemini-2.5-flash --disable_image_extraction --output_format markdown --output_dir "$OutDir" ` "$In"

You might be able to tell the LLM to directly output the data in CSV format - granted it will still be in a .md file - using the `--block_correction_prompt` which apparently is "useful for custom formatting or logic that you want to apply to the output"

1 comments

> Nowadays there's probably a solution based on LLMs, but I don't trust them with this kind of data

If it works with a small model I can run locally, I might think of this approach, otherwise I'll skip