Hacker News new | ask | show | jobs
by mpeg 2372 days ago
Hey this is great, I made something ad-hoc to do this for a client and might borrow some ideas to improve it.

I heavily leaned on AWS Textract for the bounding boxes though, as the kind of data I had to extract didn't have very well defined fields. I used some of the techniques described in this link [0] particularly around table extraction.

I really like how you define the fields in YAML though, I defined mine in code and it ended up being a bit messy.

[0]: https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-p...