Hacker News new | ask | show | jobs
by m3nu 2372 days ago
Automatically finding specific boxes/fields is quite interesting. I maintain a Python package[1] that processes invoices using a template/regex-based approach. It works alright, but eventually runs into some limitations. The box-model from the article could push it further.

1: https://github.com/invoice-x/invoice2data

1 comments

Hey this is great, I made something ad-hoc to do this for a client and might borrow some ideas to improve it.

I heavily leaned on AWS Textract for the bounding boxes though, as the kind of data I had to extract didn't have very well defined fields. I used some of the techniques described in this link [0] particularly around table extraction.

I really like how you define the fields in YAML though, I defined mine in code and it ended up being a bit messy.

[0]: https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-p...