|
|
|
|
|
by ramoz
883 days ago
|
|
There’s no ocr or ai involved here (other than the standard fallback). What this library, and something like fitz/pymupdf, allow you to do is extract the text straight from the pdf, using rules about how to parse & structure it. (Most modern pdfs you can extract text without ocr). - much cheaper obviously but doesn’t scale (across dynamic layouts) well so you likely are using this when you can configure around a standard structure. I have found rule-based text extraction to work fairly dynamically though for things like scientific pdfs. |
|