|
|
|
|
|
by dimitri-vs
332 days ago
|
|
Have you tried datalab-to/marker with the "Use LLM" option? They have a playground you can test it out on https://www.datalab.to/playground but I use their local CLI option: https://github.com/datalab-to/marker I just tried it on a fairly ugly TD Bank statement PDF I have and the markdown of the whole PDF (tables and all) is very accurate. Here is the config I use: marker_single --format_lines --use_llm --llm_service marker.services.gemini.GoogleGeminiService --gemini_model_name gemini-2.5-flash --disable_image_extraction --output_format markdown --output_dir "$OutDir" `
"$In" You might be able to tell the LLM to directly output the data in CSV format - granted it will still be in a .md file - using the `--block_correction_prompt` which apparently is "useful for custom formatting or logic that you want to apply to the output" |
|
If it works with a small model I can run locally, I might think of this approach, otherwise I'll skip