Hacker News new | ask | show | jobs
by cmroanirgo 2764 days ago
Found some interesting tidbits in their FAQ [0]:

"Q: What type of text can Amazon Textract detect and extract?

A: Amazon Textract can detect Latin-script characters from the standard English alphabet and ASCII symbols."

So, English only. But very worryingly is that they're going to keep your companies' documents:

"Q. Are document and image inputs processed by Amazon Textract stored, and how are they used by AWS?

A: Amazon Textract may store and use document and image inputs processed by the service solely to provide and maintain the service and to improve and develop the quality of Amazon Textract..."

"Q. Can I delete images and documents stored by Amazon Textract?

A: Yes. You can request deletion of document and image inputs associated with your account by contacting AWS Support. Deleting image and document inputs may degrade your Amazon Textract experience."

That said, I'm still baffled on what value-add they're providing? For me, from the name alone, it would generate other documents of common types: .txt (without images), .doc, .html (zip). That is, a large part of extracting text is the ability to reflow the text across page boundaries & columns. However, this product states that:

"All extracted data is returned with bounding box coordinates" [1]

...which is how pdf documents lay things out in the first place...Have I missed something?

[0] https://aws.amazon.com/textract/faqs/

[1] https://aws.amazon.com/textract/features/

4 comments

The point of this service is to train their own OCR models for use in other products like Kindle / their e-book store. There doesn't really need to be a value add - if people use it it's a win for them... if people don't it's not really a big loss.
But in order to train something you have to have the input of what is actually there, I don’t see how that is provided here.
There might be a way for users to rate the quality of the result, or at least report it if it is very wrong.
Think less about books, and more about automating input from forms filled out by hand. In working with this tech, I can say that none of it is great and it would be very nice to be able to ditch what's available for stuff that would work better.

For my employer's use case, the data storage and privacy implications are a non-starter.

Wonder if they will offer a local solution.
Shameless plug: I work on custom solutions that do this locally, shoot me an email if interested.
As tracker1 mentioned, don't think of this as for reflowing text for different devices but as a data capture and documents processing solution.

Example: You are dealing with a lot of PDF documents that contain unstructured information (e.g. a filled form) and you need to extract bits of information (e.g. name, address) and output it in a structured format (e.g. JSON/XLS).

Keeping documents and analyzing your business is not new and will not keep people from using it in their companies I'm afraid. At least it doesn't stop people from Using Windows and other M$ products.