| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by denysvitali 328 days ago

I would love to have something more generic (and tried to build it already), but parsing tables and bank statements even from digital PDFs (as in, those that really have tables and not a picture) is still very difficult. Especially when the bank changes layouts from one month to another.

I would love to be proven wrong, but everything I have tried so far is... subpar.

Nowadays there's probably a solution based on LLMs, but I don't trust them with this kind of data

3 comments

vdm 328 days ago

https://github.com/qyhou/curated-table-structure-recognition

link

dimitri-vs 328 days ago

Have you tried datalab-to/marker with the "Use LLM" option? They have a playground you can test it out on https://www.datalab.to/playground but I use their local CLI option: https://github.com/datalab-to/marker

I just tried it on a fairly ugly TD Bank statement PDF I have and the markdown of the whole PDF (tables and all) is very accurate. Here is the config I use:

marker_single --format_lines --use_llm --llm_service marker.services.gemini.GoogleGeminiService --gemini_model_name gemini-2.5-flash --disable_image_extraction --output_format markdown --output_dir "$OutDir" ` "$In"

You might be able to tell the LLM to directly output the data in CSV format - granted it will still be in a .md file - using the `--block_correction_prompt` which apparently is "useful for custom formatting or logic that you want to apply to the output"

link

denysvitali 328 days ago

> Nowadays there's probably a solution based on LLMs, but I don't trust them with this kind of data

If it works with a small model I can run locally, I might think of this approach, otherwise I'll skip

link

jgalt212 328 days ago

> Nowadays there's probably a solution based on LLMs, but I don't trust them with this kind of data

In practice, the flow from my perspective looks like LLM parser -> normalizer -> validator. So you only save one step (parser), and given the unique stochastic nature of the LLM output, the normalizer and validator can be trickier to write than one used for an old-fashioned rules-based parser. But each situation is different, or YMMV.

link