Hacker News new | ask | show | jobs
by hinkley 1578 days ago
In the more general sense, there is an almost universally overlooked property of compressed data that some of the data is unordered, and you are free to rearrange it any way you want in order to improve the compression ratio.

Years ago I worked on a J2ME (Java2 Mobile Edition) application that had no business being attempted given the very small archive files allowed. We did it anyway and it actually worked pretty well. We very quickly hit the max file size however, and every feature request meant first shrinking the existing code base to make space. First we had an intern fixing bugs in the code minifier we were using, especially around deleting unused (usually debug) methods. For some reason they rejected on archive size, not payload size, so while I started out doing 'honest' work with shrinking the binary, I had spent a lot of time in college noodling with compression algorithms so my eye was eventually drawn there.

Those were in the days when I could still read JVM assembly code, and shortly after I started thinking about the compression, I realized that the constant pool entries start with the type and then the size of the entry. So while our minifier made the reasonable assumption of sorting the constant pool by type and then alphabetically within it, because most of the constant pool was strings, and strings are variable length, it was hit or miss whether the header would be treated as a run or just Huffman encoded (the fallback). If I suffix sorted, then all but the last string in the pool would be followed immediately by the header for the next string, increasing the average run length.

This ended up knocking almost a kilobyte off of the archive size. Depending on your perspective that sounds like a little or a lot, but in our case each feature cost about 500 bytes, so that change pushed the cliff I was walking toward out almost a month (and slowing the growth rate), just by changing a sort algorithm.

I filed a ticket with Sun about this, but as it turns out they already had the dense archive format in flight, and within a couple months my observation was moot because the dense format can compress constant pools across and entire archive, not just a singe file. That was at least an order of magnitude better than what I had.

It's quite likely a lot of the files we use have similar problems in them. Off the top of my head, JSON compression probably would be much higher if we treated it as unordered, and did more aggressive minification particularly for JSON-at-rest. Sorting sibling keys by value instead of by name for instance.