|
|
|
|
|
by nottorp
315 days ago
|
|
How many different PDF generators have done those millions of PDFs tho? Because you're right if you're paid to evaluate all the formats with the Mark 1 eyeball and do a custom parser for each. It sounds like it's feasible for your application. If you want a generic solution that doesn't rely on a human spending a week figuring out that those 4 absolutely positioned text fields are the invoice number together (and in order 1 4 2 3), maybe you're wrong. Source: I don't parse pdfs for a living, but sometimes I have to select text out of pdf schematics. A lot of times I just give up and type what my Mark 1 eyeball sees in a text editor. |
|
So it is absurd to pretend you can solve the rendering problem by rendering it into an image instead of a structured format. By rendering it into a raster, now you have 3 problems, parsing the PDF, rendering quality raster, then OCR'ing the raster. It is mind numbingly absurd.