|
|
|
|
|
by joakleaf
317 days ago
|
|
We do a lot of parsing of PDFs and basically break the structure into 'letter with font at position (box)' because the "structure" within the PDF is unreliable. We have algorithms that combines the individual letters to words, words to lines, lines to boxes all by looking at it geometrically. Obviously identify the spaces between words. We handle hidden text and problematic glyph-to-unicode tables. The output is similar to OCR except we don't do the rasterization and quality is higher because we don't depend on vision based text recognition. The base implementation of all this, I made in less than a month 10 years ago and we rarely, if ever, touch it. We do machine learning afterwards on the structure output too. |
|
> quality is higher because we don't depend on vision based text recognition
This surprises me a bit; outside of an actual scan leaving the computer I’d expect PDF->image->text in a computer to be essentially lossless.