Hacker News new | ask | show | jobs
by llm_trw 638 days ago
I'm currently solving this problem for work and thinking of a spin out, what's a ballpark figure you'd be willing to pay per 1000 pages for 99.999% character level accuracy?
3 comments

At least for my use case, which is Layout processing (i.e. must output tables in some kind of table format), the OCR part (Azure Document AI or AWS Textract) dominates the cost factor.

Running OCR on a document is twice more expensive than processing the output on the most expensive GPT offering. Intuitively, this was kind of unexpected for me. Only when I did some calculations on Excel that I realized it.

If you’re able to halve the pricing for Layout output then you’re unblocking lots of use cases out there.

> I'm currently solving this problem for work and thinking of a spin out, what's a ballpark figure you'd be willing to pay per 1000 pages for 99.999% character level accuracy?

I guess anything up to 5 ¢ per page would be acceptable. But I'm afraid my company wouldn't be a customer. We are in Germany and we deal with particularly protected private data, there is no chance that we would exfiltrate this data to a cloud service.

What's the total spend per quarter? For a margin that fat I'd be willing to jump through a lot of hoops if you're doing enough pages.

The models (currently) fit in 24gb vram sequentially with small enough batch sizes, so a local server with consumer grade gpus wouldn't be impossible.

I'll check and get back to you. How can I reach you?
Email at omni_vision_ai@proton.me
I guess it depends on the use case, but if it surpasses the error rate that exists in the source document then it would be difficult to argue against.

Specific things like evidentiary use would want 100% but that's at a level where any document processing would be suspect.

What is the the typical range for error rate in PDF generation in various fields? Even robust technical documents have the occasional typo.

I'm not using generative models to fill in details not present in the original document. If there's a typo there then there will be a typo in the transcript. If you want to fix that then you can run another model on top of it.
I realise that. The point is that a user is implicitly committing to the baseline error rate that exists in whatever means by which the document was created. If any additional loss was insignificant in proportion to that error rate then it would be unreasonable to reject it on that basis.
You're right. For my API that prepares PDFs for LLMs, fixing typos makes sense. But yeah, keeping original text is crucial for most OCR tasks.