| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ramoz 883 days ago

There’s no ocr or ai involved here (other than the standard fallback).

What this library, and something like fitz/pymupdf, allow you to do is extract the text straight from the pdf, using rules about how to parse & structure it. (Most modern pdfs you can extract text without ocr).

- much cheaper obviously but doesn’t scale (across dynamic layouts) well so you likely are using this when you can configure around a standard structure. I have found rule-based text extraction to work fairly dynamically though for things like scientific pdfs.