| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jgresula 5923 days ago
	That's a non-trivial task. There are no such objects like tables, styles, lists or paragraphs in PDF so you would need to reconstruct this kind of information. Also, text and vector graphics is positioned absolutely. Tagged PDFs contains some meta-information about the document structure which could help but still it is a lot of work. The fundamental problem is that PDF stores the document presentation while html defines the document and the presentation is created by the browser. And obviously, to restore a document definition from its presentation is hard as lot of information is missing.

2 comments

dpapathanasiou 5923 days ago

That's a non-trivial task.

Yes, that's true.

I only bring it up b/c if your goal is to turn pdfcrowd into an app that people would pay money for (and I would be one of them), solving that problem would go a long way towards achieving it.

link

thepsi 5923 days ago

Solving it perfect is non-trivial (I've known entire PhDs to be spent working on a small subset of the problem). There are a number of products/projects that solve it to some extent (techniques include absolute positioning & making sweeping assumptions about what constitutes a paragraph) - would this be enough for you to consider paying for, given that their assumptions/workarounds might produce HTML files that aren't quite to your 'taste'?

link

latortuga 5923 days ago

There already many apps and pieces of software that charge for the feature he already has so I don't see why it is a requirement for him to monetize. It definitely would be an easy feature to charge for but I think what he has already has potential.

link

brandnewlow 5923 days ago

Total noob question, couldn't you programmatically capture a browsershot and then convert that into a PDF?

HTML -> png seems to have been figured out. Is .png -> pdf that hard to do?

link

vibhavs 5923 days ago

No, .png to .pdf is not difficult.

I believe dpapathanasiou's suggestion is not to blindly convert a pdf into html file with one giant image file of the pdf.

Instead, he wants to create an html document that maintains the same content and layout from the pdf.

link

brandnewlow 5923 days ago

D'Oh! Got myself mixed up there a bit.

link

dmv 5923 days ago

NitroPDF does a remarkably good job translating PDF to Doc and RTF. I think the application (windows :() is better/has more output options, but they have a free web service: http://www.pdftoword.com/

link