Hacker News new | ask | show | jobs
by mota7 1492 days ago
Yes. The formulation in the blog post is a bit messy, but possibly more general way to think about it is something like:

0. set y = 1. set x = 0.

1. How many possible choices are there _given what we've picked so far_? Call it N.

2. Pick one of them. Call it 'a'.

3. set x = x + a * y. set y = y * N.

4. Repeat steps 1-3 until the number of choices drops to be less than 2.

Decoding is the same thing in reverse. Start with x. mod it with the first N. that's the first choice. divide by N, mod with the number of remaining choices. that's the second choice. etc.

For the case of n choose k, we get the rule that we can't pick anything lower than what has been picked so far. This means that first choice is one of (n-k+1) items. (We can't pick higher than that as we'd run out of choices before we reached k).

With this construction, we can see your case is easy. You just need to be able to signal that some choice was the last choice. So the each choice is from a set of N+1, being the N possibilities plus "picked nothing". Picking nothing means there's no remaining choices so it terminates.

2 comments

This is an easy formulation but not necessarily optimal. In particular the step 1 is not well defined, and any further definition results in a non-optimal encoding.

Say we have n = 6 and k = 3 and we've already seen 4; what are possible choices at this point? If you allow 3 here it would be equivalent to picking 3 first then 4 and there would be a redundant encoding. You can "fix" this by fixing the order, so without a loss of generality let's assume a strictly increasing order. Then the worst case would be (1, 2, 3), where N is respectively 6, 5 and 4, resulting in log_2 120 = 6.90 bits of information entropy instead of the optimal log_2 24 = 4.32 bits.

The real fix would involve counting each subcases. There are 10 strictly increasing sequences starting with 1, 6 starting with 2, 3 starting with 3 and one starting with 4. The step 1 should account for this non-uniform distribution at the first place. The combinatorial number system described in the OP does this automatically.

For the first case, you can't pick a number lower than any existing number. So if 4 is already picked, then only 5 or 6 can be picked, but if we pick 6, then there's no third number available to pick, so there only 1 possibility (picking 5). So it terminates.

The worst case of (1,2,3) is respectively 4, 3 and 2, resulting in log_2 24 = 4.32 bits.

The first choice isn't picking from 6, it's picking from 4. (because if 5 or 6 are chosen, then there's not enough following numbers to be able to choose 2 more numbers). That's the piece about '(n-k+1) choices for the first number'. 5 or 6 are not possible choices given the ordering requirement.

> The worst case of (1,2,3) is respectively 4, 3 and 2, resulting in log_2 24 = 4.32 bits. The first choice isn't picking from 6, it's picking from 4.

Oh, you are correct that the first number has only 4 possible choices. But what about the second number? It can't be 1 (already picked) and 6 (ordering requirement), but other numbers are open to pick. The third number is no different, it can't be 1 and 2 (already picked) but otherwise all other numbers are allowed. The resulting N would be 4, 4 and 4, i.e. 6 bits.

Maybe it helps to see the fuller picture:

  1st  2nd  3rd         2^(expected # of bits)
  
  1    2    3, 4, 5, 6  64
  1    3    4, 5, 6     48
  1    4    5, 6        32
  1    5    6           16
  
  2    3    4, 5, 6     36
  2    4    5, 6        24
  2    5    6           12
  
  3    4    5, 6        16
  3    5    6           8
  
  4    5    6           4
Ideally (and necessarily) the last column should be 24 for all possible cases, but your algorithm can go anywhere from 4 (2 bits) to 64 (6 bits).
Nuts, you are correct that it isn't optimal.

It's not quite as simple as just 2 bits per digit, because the last digit can only be large if the earlier digits are small.

This makes the worst case is (1,2,6) which encodes as 48, for 5.58 bits.

Great! Very clear. This should be the TL;DR... or the dense formulation of the dense encoding ;-)