Hacker News new | ask | show | jobs
by technicolorwhat 1931 days ago
Is there also a solution for automatic border detection. Last year tried reading bank statements, which were scanned slips. Unfortunately they didn't have any borders which made it super difficult to extract content. Would be cool if someone could make something for this :) I thought it would be easy but I broke my mind on it for several days until I gave up.
3 comments

https://github.com/eihli/image-table-ocr seems to automatically find tables within larger images, IDK if it works without borders though.
The logic for detecting a table is to get rid of everything but vertical lines over a certain length, save that in one image, then get rid of everything but horizontal lines of a certain length, save that image. Then overlay the two and take the bounding rectangle. So you don't need the table to have a border as long as you have vertical and horizontal lines and they extend far enough to encompass all the data you need.
Yep — reach out to the email in bio. It’s Mac based right now, I’m working on a windows and Linux version.
Azure FormRecognizer API
I am not sure if this works since they are not forms but statements. I.e. no defined structure only the columns are fixed width but the rows are diffewrent sizes without borders.Would be cool if it worked though. I'll give it a go.
The name "form recognizer" is perhaps poorly given, considering it can detect much more than forms (eg invoices, receipts). You can create your own custom models as well.

Disclaimer: I work for creator of said service

Can confirm this is the best out there