Hacker News new | ask | show | jobs
by just_myles 2763 days ago
Agreed. Anything that can lighten the load of having to write custom scripts to handle pdf-to-data conversions will be helpful.

I do maintain some level of skepticism though. It is ocr :D

1 comments

Even if AWS goes the cynical route of making Textract be an upsell to MTurk -- e.g. the Textract output is not reliable enough on its own, but structured for easy piping to a MTurk job -- that's got to be useful for the many folks who send entire pages to MTurk when they just need a couple boxes proofread.

As an example of a more scripted/structured job, ProPublica built out a crowdsourcing framework in Rails to extract data from FCC filings. But even that was quite difficult, because every state/TV station has its own kind of form: https://projects.propublica.org/free-the-files/