I once wrote (maybe 15-20 years ago?) an html output processor that tried to make it more compressible while still producing the exact same output. It did things like removed comments, transformed all tag names to lower case, sorted tag attributes and canonicalized values, collapsed whitespace (including line feeds).
And some more tricks I've forgotten (some DOM tree tricks, I think), mainly to introduce more repeated strings for LZ and unbalanced distribution (=less output bits) for Huffman. In other words, things that help gzip to compress even further.
Output was really small, most pages were transformed from gzipped sizes of 10-15 kB to 2-5 kB without graphics.
The pages loaded fast, pretty much instantly, because they could fit in the TCP initial window, avoiding extra roundtrips. Browser sent request and server sent all HTML in the initial window even before the first ACK arrived! I might have tweaked initial window to 10 packets or something (= enough for 14 kB or so), I don't remember these TCP details by heart anymore.
I wonder if anyone else is making this kind of HTML/CSS compressability optimizers anymore. Other than Javascript minimizers.
They are! Around five years ago I wrote a CSS minifier (creatively called CSSMin, available on GitHub, and still in use at the company I work for) which rewrote the CSS to optimise gzip compression. Although it never really took off, I think that some of the lessons from it have been rolled into some of the more modern CSS optimisation tools.
It's important to understand minimizing does not necessarily produce the most compressible result. You need to give LZ repeating strings as much as possible while using as few different ASCII characters as possible with as unbalanced frequency distribution as possible.
I wrote (well, expanded) a similar tool for compressing Java Class files. I had a theory that suffix sorting would work slightly better because of the separators between fields, and it turned out to be worth another 1% final size versus prefix sorting.
I've found a cheap trick to compress Java software: extract every .jar file (those are zip archives) and compress the whole thing with a proper archiver (e.g. 7-zip).
One example from my current project:
original jar files: 18 MB
expanded jar files: 37 MB
compressed with WinRar: 10 MB
And that's just a little project. For big projects there could be hundreds of megabytes of dependencies. Nobody really cares about that...
It's a tradeoff; in a lot of cases, the size of a .jar doesn't really matter because it ends up on big web containers.
It does matter for e.g. Android apps though. But at the same time, the size of the eventual .jar is something that can be optimized by Google / the Android store as well, using what you just described for starters.
I know Apple's app store will optimize an app and its resources for the device that downloads it. As a developer you have to provide all image resources in three sizes / pixel densities for their classes of devices. They also support modular apps now, that download (and offload) resources on demand (e.g. level 2 and beyond of a game, have people get past level 1 first before downloading the rest).
There’s a lot of redundancy between class files in Java and zlib only has one feature for that and nobody uses it. It would require coordination that doesn’t really exist.
For transport, Sun built a dense archive format that can compress a whole tree of files at once. It normalizes the constant pool (a class file is nearly 50% constants).
Many Java applications run from the Jar file directly. You never decompress them. But you also only see something like 5:1 compression ratios.
I might still have it on some hard disk that's been unplugged in storage for ages. But probably long since lost. I wrote it by trying out different things and seeing how it affected gzipped size.
Just use some HTML parser and prune html comment nodes and empty elements when safe (for example removing even empty div is not!), collapse whitespace, etc. If majority of text nodes is in lower case, ensure also tags, attribute names etc. is as well. Ensure all attribute values are same way, say attr=5, but not attr='5' or attr="5". Etc. That's all there is to it.
It saved a lot already as a result of whitespace collapsing, which also removes high frequency chars like linefeeds, etc. leaving shorter huffman table entries for the data that actually matters.
If your page is static, it's even worth trying something like zopfli or advancecomp to maximise compression ratio in ways too expensive to do "online".
That's obviously true, however a minimized version will require less memory and slightly less cpu-cycles* to compress and, on the client side, it requires slightly less resources as well
I do realize how insignificant difference that would be
* then again not much of a difference since the DOM tree itself would consume orders o magnitude more mem.
Probably not less memory. zlib is based on a design that dates back to an era where you might only have 250-350 kilobytes (not a typo) of RAM to work with, and it was never really extended beyond that. It has a window it keeps in memory and if your file is longer than that window, you hit peak memory and stay there (you might actually hit that window immediately. I've forgotten how that part works, but some chunks of memory are pre-allocated).
That's really DEFLATE, the sliding window of standard deflate is 32KB. Both compression and decompression have some overhead (compression more so as you might want to have index tables and whatnot to make finding matcher faster) but even with the worst possible intention there's only so much overhead you can add.
That's probably a misunderstanding or misremembering: the DEFLATE format can only encode distances of 32K (the proprietary DEFLATE64 allows 64K distances but not everything supports it).
And some more tricks I've forgotten (some DOM tree tricks, I think), mainly to introduce more repeated strings for LZ and unbalanced distribution (=less output bits) for Huffman. In other words, things that help gzip to compress even further.
Output was really small, most pages were transformed from gzipped sizes of 10-15 kB to 2-5 kB without graphics.
The pages loaded fast, pretty much instantly, because they could fit in the TCP initial window, avoiding extra roundtrips. Browser sent request and server sent all HTML in the initial window even before the first ACK arrived! I might have tweaked initial window to 10 packets or something (= enough for 14 kB or so), I don't remember these TCP details by heart anymore.
I wonder if anyone else is making this kind of HTML/CSS compressability optimizers anymore. Other than Javascript minimizers.