Hacker News new | ask | show | jobs
by hnlmorg 638 days ago
AWS Textract does use ML and I’ve personally used it to parse tables for automated invoice processing.

You wouldn’t get a markdown document automatically generated (or at least you couldn’t when I last used it a few years ago) but you did get an XML document

That XML document was actually better for our purposes because it gives you a confidence score and is properly structured, so floating frame, tables and columns would be properly structured in the output document. This reduces the risk of hallucinations.

It’s less of an out-of-the-box solution but that’s to be expected with AWS APIs.

2 comments

For a similar use case I’m using Azure Document AI - at least you can ask for markdown/html output directly from it instead of parsing the output structure from Textract.

And it’s cheaper too.

You can get Markdown nowadays too, at least using this Python wrapper:

https://aws-samples.github.io/amazon-textract-textractor/not...

It's very consistent, though pricey.