| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by Nyandalized 2520 days ago
	With the help of transport compression like gzip, the total size can still be reduced by almost the same amount even if you don't minimize it.

3 comments

vardump 2520 days ago

I once wrote (maybe 15-20 years ago?) an html output processor that tried to make it more compressible while still producing the exact same output. It did things like removed comments, transformed all tag names to lower case, sorted tag attributes and canonicalized values, collapsed whitespace (including line feeds).

And some more tricks I've forgotten (some DOM tree tricks, I think), mainly to introduce more repeated strings for LZ and unbalanced distribution (=less output bits) for Huffman. In other words, things that help gzip to compress even further.

Output was really small, most pages were transformed from gzipped sizes of 10-15 kB to 2-5 kB without graphics.

The pages loaded fast, pretty much instantly, because they could fit in the TCP initial window, avoiding extra roundtrips. Browser sent request and server sent all HTML in the initial window even before the first ACK arrived! I might have tweaked initial window to 10 packets or something (= enough for 14 kB or so), I don't remember these TCP details by heart anymore.

I wonder if anyone else is making this kind of HTML/CSS compressability optimizers anymore. Other than Javascript minimizers.

barryvan 2520 days ago

They are! Around five years ago I wrote a CSS minifier (creatively called CSSMin, available on GitHub, and still in use at the company I work for) which rewrote the CSS to optimise gzip compression. Although it never really took off, I think that some of the lessons from it have been rolled into some of the more modern CSS optimisation tools.

vardump 2520 days ago

It's important to understand minimizing does not necessarily produce the most compressible result. You need to give LZ repeating strings as much as possible while using as few different ASCII characters as possible with as unbalanced frequency distribution as possible.

hinkley 2520 days ago

I wrote (well, expanded) a similar tool for compressing Java Class files. I had a theory that suffix sorting would work slightly better because of the separators between fields, and it turned out to be worth another 1% final size versus prefix sorting.

vbezhenar 2520 days ago

I've found a cheap trick to compress Java software: extract every .jar file (those are zip archives) and compress the whole thing with a proper archiver (e.g. 7-zip). One example from my current project: original jar files: 18 MB expanded jar files: 37 MB compressed with WinRar: 10 MB

And that's just a little project. For big projects there could be hundreds of megabytes of dependencies. Nobody really cares about that...

Cthulhu_ 2520 days ago

It's a tradeoff; in a lot of cases, the size of a .jar doesn't really matter because it ends up on big web containers.

It does matter for e.g. Android apps though. But at the same time, the size of the eventual .jar is something that can be optimized by Google / the Android store as well, using what you just described for starters.

I know Apple's app store will optimize an app and its resources for the device that downloads it. As a developer you have to provide all image resources in three sizes / pixel densities for their classes of devices. They also support modular apps now, that download (and offload) resources on demand (e.g. level 2 and beyond of a game, have people get past level 1 first before downloading the rest).

hinkley 2520 days ago

There’s a lot of redundancy between class files in Java and zlib only has one feature for that and nobody uses it. It would require coordination that doesn’t really exist.

For transport, Sun built a dense archive format that can compress a whole tree of files at once. It normalizes the constant pool (a class file is nearly 50% constants).

Many Java applications run from the Jar file directly. You never decompress them. But you also only see something like 5:1 compression ratios.

Hitton 2520 days ago

That's extremely interesting. Would you happen to still have the code lying around? Or would you recommend some itroductory materials on this topic?

vardump 2520 days ago

I might still have it on some hard disk that's been unplugged in storage for ages. But probably long since lost. I wrote it by trying out different things and seeing how it affected gzipped size.

Just use some HTML parser and prune html comment nodes and empty elements when safe (for example removing even empty div is not!), collapse whitespace, etc. If majority of text nodes is in lower case, ensure also tags, attribute names etc. is as well. Ensure all attribute values are same way, say attr=5, but not attr='5' or attr="5". Etc. That's all there is to it.

It saved a lot already as a result of whitespace collapsing, which also removes high frequency chars like linefeeds, etc. leaving shorter huffman table entries for the data that actually matters.

Study how LZ77 and Huffman works.

okaleniuk 2520 days ago

Wow! That sounds fascinating.

masklinn 2520 days ago

If your page is static, it's even worth trying something like zopfli or advancecomp to maximise compression ratio in ways too expensive to do "online".

mr__y 2520 days ago

That's obviously true, however a minimized version will require less memory and slightly less cpu-cycles* to compress and, on the client side, it requires slightly less resources as well

I do realize how insignificant difference that would be * then again not much of a difference since the DOM tree itself would consume orders o magnitude more mem.

hinkley 2520 days ago

Probably not less memory. zlib is based on a design that dates back to an era where you might only have 250-350 kilobytes (not a typo) of RAM to work with, and it was never really extended beyond that. It has a window it keeps in memory and if your file is longer than that window, you hit peak memory and stay there (you might actually hit that window immediately. I've forgotten how that part works, but some chunks of memory are pre-allocated).

masklinn 2520 days ago

That's really DEFLATE, the sliding window of standard deflate is 32KB. Both compression and decompression have some overhead (compression more so as you might want to have index tables and whatnot to make finding matcher faster) but even with the worst possible intention there's only so much overhead you can add.

hinkley 2520 days ago

Level 9 uses 128k, if memory serves.

We’re talking about HTTP here, and gzip is the only reliably available compressed transport encoding.

On the plus side, because it is so resource constrained you have had it on your phone for ages, and might even see it on IoT devices.

masklinn 2519 days ago

> Level 9 uses 128k, if memory serves.

That's probably a misunderstanding or misremembering: the DEFLATE format can only encode distances of 32K (the proprietary DEFLATE64 allows 64K distances but not everything supports it).