| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by bawolff 823 days ago
	As an aside its amazing how far the web has come, where the best way to make pretty pdf documents is to literally run a web browser on the server. This would have been so unthinkable back in the 90s & 2000s

4 comments

rajh 823 days ago

I needed to transform a 12MB HTML file into a PDF document and headless Chrome quickly ran out of memory (4GB+).

We are now using a commercial alternative that seems be be using a custom engine that implements the HTML and CSS specs. The result is reduced memory usage (below 512MB during my tests) and the resulting PDF is much smaller, 3.3MB vs 42MB.

link

nojvek 822 days ago

We use docraptor based on princexml engine but haven’t tried a huge pdf. We generate 20-30 page pdfs sometime and it works great.

link

rajh 822 days ago

We are also using DocRaptor. It takes around 20 seconds to generate the PDF, and we only need to generate it every night. So the costs are also not an issue at the moment.

link

phonon 823 days ago

Did you try Weasyprint?

link

rajh 822 days ago

Yes, I’ve tried all the open source projects I could find. Including Weasyprint and wkhtmltopdf. Weasyprint was much slower than headless Chrome and also required a lot of memory to process the HTML. And wkhtmltopdf is no longer maintained and crashed while processing.

link

ManBeardPc 823 days ago

Have you tried Typst? It's like a modern version of LaTeX and allows to generate nice looking documents quickly. Can be called from the console and makes it easy to create templates and import resources like images, fonts and data (csv, json, toml, yaml, raw files, ...). Of course it is its own language instead of HTML/CSS but so far I found it quite pleasant to use.

link

sciolistse 823 days ago

Back around 2002 at least there were some products, ABCpdf is one I used a lot, which ran Internet Explorer on the server to generate PDFs from HTML. Worked pretty well from what I recall.

link

vanderZwan 823 days ago

I'm fairly certain that using a headless browser on the server is mainly about sandboxing all the security concerns that PDFs have, not aesthetics, but yes.

link

bawolff 823 days ago

Security of the pdf format is not relavent here. The headless browser outputs a PDF. It is not taking a user controlled pdf as input.

link

vanderZwan 823 days ago

Ah of course, my apologies. I misread the original post.

link

menacingly 823 days ago

it's actually because layout-via-code for arbitrary documents is a humblingly complex problem, so leveraging existing layout engines is preferred.

This impressive effort looks far better than what I'd achieve, but when this approach has been tried before, it is eventually discovered that few organizations have the resources to maintain a rendering engine long-term.

link

chearon 823 days ago

I do think complexity could be part of why we don't have many options here, but I don't agree that a layout engine is too difficult to maintain. More of the issue is that CSS layout (and maybe layout in general) is not widely well-understood. I've almost _never_ come across people interested in layout because generally it's a few properties to get something working and then you move on.

> few organizations have the resources to maintain a rendering engine long-term

I'm curious are there other instances of this happening than Edge switching to Blink? That event was one of my main motivators; it felt like further consolidation of obscure knowledge.

link

Sesse__ 823 days ago

Opera switched from Presto to Blink, too.

Very fun project! Did you ever consider integrating with web-platform-tests? It's shared between all the common browser vendors, and we're always interested in more contributors :-)

link

chearon 823 days ago

> Opera switched from Presto to Blink, too

True. But I wonder if there are more special-purpose engines similar to Prince that have been abandoned.

> Did you ever consider integrating with web-platform-tests?

I've run some of the WPT tests manually, but I don't yet have <style> support, and some of them use <script> I think? That's a path I'm wary of (eval()?) but I could have a special mode just for tests.

I did discover lots of weird corners that would be great to make some WPT tests for. Definitely something I want to do!

link

Sesse__ 823 days ago

Yes, a _lot_ of WPT tests depend on <script>. But there's also a bunch of ref-tests, where you just check that A and B match pixel for pixel (where B is typically written in the most obvious, dumb way possible). It lets you test advanced features in terms of simple ones. But yes, you'd need selector support in particular.

link