Hacker News new | ask | show | jobs
by adamcanady 4228 days ago
Wow. I can make sense of a little bit of the original source, but I wonder more how RegPack [0] works to compress it into that final submission!

As a side note: I'm in college now and looking to propose an independent study on compression. Any suggested readings or algorithms I should look into?

[0] https://github.com/Siorki/RegPack

2 comments

For lossless compression, start out by reading up on run length encoding, which is about as simple as it gets. Then there's DEFLATE, which is what's behind gzip. You'll also want to read up on Huffman coding and arithmetic coding. One of the newer ideas in lossless compression is "context mixing", which is definitely worth reading up on. The Burrows–Wheeler transform is pretty fascinating, albeit slow, so you might want to read about that as well.

Lossy compression (which RegPack is actually an example of), typically involves lossless compression plus a preprocessing step to discard information so that the lossless compressor can do a better job.

The techniques for discarding information vary based on application. For images and audio, the most successful techniques tend to involve working in frequency space, so you'll want to read up on the FFT and related transforms.

For code compression, you have different constraints when discarding information because you have to produce a program with identical output to the original. Simple techniques involve removing comments and renaming variables to have shorter names. More advanced techniques involve more complex transformations of the AST, including language-specific tricks like wrapping javascript in a "with(Math){}" block.

Hope some of that helps.

Principal component analysis and then ignoring the least important components is the only compression technique I know. Hope it helps.
Well that is a lossy compression, can work nicely to some data (e.g. images), but totally irrelevant in the context of source code compression.
Lossy compression might be just fine for source code, as long as the lossy part is still functionally equivalent.

Think 1000.0 vs 1E3, printf("foo") vs puts("foo"), etc.

And in fact, RegPack is lossy compression (as are all minifiers). The transformation is totally irreversible.
Well, he didn't say he only wanted non-lossy compression or that he was working on source code compression.