Hacker News new | ask | show | jobs
by bingdig 1922 days ago
Nice! Have used quite a few tools like this to convert data government agencies report in pdfs to csvs. The biggest challenge that existing tools fail to adequately address is when table formats vary (e.g., increasing level of indentation). Perhaps formatting those in json first would be easier
1 comments

When you say increasing the level of indention ... do you have an example handy? I’m working on a pdf / data (word, excel, docx, csv), tool at the moment, and I think it’s pretty robust to things like this.
Accounting tables often do this. This is not the perfect example, but here's a flavor of that. (last page of PDF) https://s23.q4cdn.com/574569502/files/doc_financials/2021/q4...
Yep, understandable. Right now it kicks out pdfs that don't fit the rules, but I think there are a few sensitivity variables / configs I can incorporate to make that seamless.