|
|
|
|
|
by petesergeant
319 days ago
|
|
> instead of just using the "quality implementation" to actually get structured data out? I suggest spending a few minutes using a PDF editor program with some real-world PDFs, or even just copying and pasting text from a range of different PDFs. These files are made up of cute-tricks and hacks that whatever produced them used to make something that visually works. The high-quality implementations just put the pixels where they're told to. The underlying "structured data" is a lie. EDIT: I see from further down the thread that your experience of PDFs comes from programmatically generated invoice templates, which may explain why you think this way. |
|
We have algorithms that combines the individual letters to words, words to lines, lines to boxes all by looking at it geometrically. Obviously identify the spaces between words.
We handle hidden text and problematic glyph-to-unicode tables.
The output is similar to OCR except we don't do the rasterization and quality is higher because we don't depend on vision based text recognition.
The base implementation of all this, I made in less than a month 10 years ago and we rarely, if ever, touch it.
We do machine learning afterwards on the structure output too.