Hacker News new | ask | show | jobs
by danso 2757 days ago
Given how high and continuing the popularity of the "simple" conversion of regular PDF forms/tables -- even for the technically-sophisticated HN audience [0] -- if Amazon can deliver on OCR-to-data, that feels like a huge achievement. Not as sexy (or creepy) as Rekognition, perhaps, but almost certainly more day-to-day useful to the many, many professionals who work with documents and legacy data entry systems.

[0] https://hn.algolia.com/?query=pdf%20convert&sort=byPopularit...

- https://news.ycombinator.com/item?id=18199708

- https://news.ycombinator.com/item?id=5487530

2 comments

Agreed. Anything that can lighten the load of having to write custom scripts to handle pdf-to-data conversions will be helpful.

I do maintain some level of skepticism though. It is ocr :D

Even if AWS goes the cynical route of making Textract be an upsell to MTurk -- e.g. the Textract output is not reliable enough on its own, but structured for easy piping to a MTurk job -- that's got to be useful for the many folks who send entire pages to MTurk when they just need a couple boxes proofread.

As an example of a more scripted/structured job, ProPublica built out a crowdsourcing framework in Rails to extract data from FCC filings. But even that was quite difficult, because every state/TV station has its own kind of form: https://projects.propublica.org/free-the-files/

There's Google Cloud Vision and Microsoft Cognitive Services that act as competitors to Amazon Rekognition, but AFAIK there's no offering from a FAANG that competes with AWS Textract.

It looks like it's competing with ABBYY (FlexiCapture) and Kofax.