|
|
|
|
|
by ashishb
335 days ago
|
|
I speak from experience that this is a bad idea. There are cases where documents contains text with letters that look the same in many font. For example, 0 and O looks identical in many fonts. So if you have a doc/xls/PDF/html then you lose information by converting it into an image. For cases like serial numbers, not even humans can distinguish 0 vs O (or l vs I) by looking at them. |
|
For that reason, IMO rendering a PDF page as an image is a very reasonable way to extract information out of it.
For the other formats you mentioned, I agree that it is probably better to parse the document instead.