|
|
|
|
|
by m3nu
2372 days ago
|
|
Automatically finding specific boxes/fields is quite interesting. I maintain a Python package[1] that processes invoices using a template/regex-based approach. It works alright, but eventually runs into some limitations. The box-model from the article could push it further. 1: https://github.com/invoice-x/invoice2data |
|
I heavily leaned on AWS Textract for the bounding boxes though, as the kind of data I had to extract didn't have very well defined fields. I used some of the techniques described in this link [0] particularly around table extraction.
I really like how you define the fields in YAML though, I defined mine in code and it ended up being a bit messy.
[0]: https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-p...