Hacker News new | ask | show | jobs
by recon517 4986 days ago
Store each number as a delta from a previous one, first number as a delta from 0. Each number starts with a code describing its length: 0 (duplicate), 7, 9 or 27 bits. For storing a code use values '1', '01', '001' and '0001'. Calculate statistics for different code types and when space is tight replace most common codes with shortest ones.
1 comments

Bit late to this as I've been ill.

If you try that encoding scheme with some random data you'll find it won't always fit.

I took 1,000,000 random numbers between 0 and 99,999,999 and sorted them and calculated the deltas. Here's the distribution of the sizes of the deltas:-

   diffbits: 1 nos 5009
   diffbits: 2 nos 19937
   diffbits: 3 nos 38275
   diffbits: 4 nos 72319
   diffbits: 5 nos 128146
   diffbits: 6 nos 202000
   diffbits: 7 nos 252698
   diffbits: 8 nos 202681
   diffbits: 9 nos 72750
   diffbits: 10 nos 6148
   diffbits: 11 nos 37
   totbits = (1*5009)+(2*19937)+(3*38275)... = 6,408,685
That's the total number of bits required to store just the deltas, but doesn't include the length encoding.

810241024 - 6408685 = 1979923 spare bits.

Each of the 999,999 deltas will take at least one bit to determine the size of the delta, leaving you with 979,924 bits for extra length encoding and wastage (if you pack multiple lengths under one encoding).

There's no way I can see (after trying lots of permutations) to be able to encode all of those bunches of diffs using those few remaining bits.

What's even more difficult is that you don't have the space to 'calculate statistics for different code types' because you can't fit anywhere near all of the deltas into that 1MB of memory without having them encoded perfectly anyway. Calculating them on partial data is all that you can do, and that's not going to be accurate enough because you've no idea what the unseen data has to hold (it could all be duplicates or 11 bit differences...).

I'm keeping going on looking at this (and trying not to look at any of the posted solutions).

You are right, my solution is not really working. But I still believe that there might be a working solution of encoding deltas.

Thank you for your reply!