Hacker News new | ask | show | jobs
by fl0under 485 days ago
Looks cool!

May also be interested in Allen AI's OCR tool olmOCR they just released too [1][2]. They say "convert a million PDF pages for only $190 USD".

[1] https://github.com/allenai/olmocr [2] https://arxiv.org/abs/2502.18443

1 comments

The issue with that promise is that anyone can convert pdfs, the question is whether the conversions are correct or whether you have

Income Expenses 200 100

On one document, and

Income Expenses 20 0100

On others.

There's no shortage of products that tried to solve this problem from scratch (or by piggybacking on other projects) and called it a day without worrying about the huge problem that is quality and parseability.

The most robust players just give you the coordinates of a glyph and you are on your own: Textract, PDFBox.