| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yxhuvud 359 days ago
	Well, you clearly hasn't parsed a wide variety of pdfs. Because if you had, you had been exposed to pdfs that contain only images, or those that contain embedded text, but that embedded text is utter nonsense and doesn't match what is shown on the page when rendered. And that is before we even get into text structure, because as everyone knows, reading text is easier if things like paragraphs, columns and tables are preserved in the output. And guess what, if you just use the parsing engine for that, then what you get out is a garbled mess.

1 comments

throwaway4496 357 days ago

If your rendering engine doesn't output what is shown, your engine is broken, and it can be broken whatever you render it into bitmap or structured data.