| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vitus 1971 days ago

So, from a quick read, I think there are a few things at play here that allow for "compression of random data".

One, and probably the biggest one, _this isn't lossless compression_. As other commenters mentioned, this aggregates groups of points into line segments and stores their slopes (allowing for a pre-specified error of up to epsilon).

Two, while the sample input data is randomly generated, it then needs to be sorted before it can be used here. This completely changes the distributional qualities (see: order statistics sampled from a uniform distribution [0]). Just as a toy example, suppose this was a million randomly-generated binary digits. Sure, you could store the million digits in sorted order, or you could just use run-length encoding and say "I have 499,968 zeroes and 500,032 ones" [1].

[0] https://en.wikipedia.org/wiki/Order_statistic#Order_statisti...

[1] I know, this is a dense sampling on the input space. But that's the sort of intuition that allows you to compress sorted data better than you'd be able to compress the unsorted data. The provided C++ code provides a sparse sampling.