|
|
|
|
|
by mpeg
2372 days ago
|
|
Hey this is great, I made something ad-hoc to do this for a client and might borrow some ideas to improve it. I heavily leaned on AWS Textract for the bounding boxes though, as the kind of data I had to extract didn't have very well defined fields. I used some of the techniques described in this link [0] particularly around table extraction. I really like how you define the fields in YAML though, I defined mine in code and it ended up being a bit messy. [0]: https://datascience.blog.wzb.eu/2017/02/16/data-mining-ocr-p... |
|