| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tromp 756 days ago

> To make it unambiguous we must make sure that no code word is a prefix of another code word.

Technically, this is not quite correct. The class of so-called uniquely decodable codes is unambigous, and a superset of the prefix codes. One simple example of a uniquely decodable code is the reverse of a prefix code. For the example in the article that would be

    a 1
    b 00
    c 10

While the code for a is a prefix of the code of c, one can still unambiguously decode any code sequence by processing it in reverse order. It would be interesting to see a uniquely decodable code that is neither a prefix code nor one in reverse.

4 comments

a1369209993 756 days ago

> It would be interesting to see a [not gratuitously inefficient] uniquely decodable code that is neither a prefix code nor one in reverse.

This can be done by composing a prefix code with a suffix code:

    A   0
    B  01
    C  11
  a A  0
  b BA 010
  c BB 0101
  d BC 0111
  e C  11
  {a=0,b=010,c=0101,d=0111,e=11}

This is trivially uniquely decodable by uniquely decoding 0->A/etc backward, then uniquely decoding A->a/etc foreward. It's equivalent in lengths to the optimal prefix code {a=0,b=110,c=1110,d=1111,e=10} so it's a (one of several) optimal code for the same probability distributions.

And it's neither prefix nor suffix itself, since a=0 and b=010. In fact, it can't in general be decoded incrementally at all, in either direction, since "cee...ee?" vs "bee...ee?" and "?cc...cca" vs "?cc...ccb" both depend on unbounded lookahead to distinguish a single symbol.

I'm not sure the optimality holds for any composition of a in-isolation-optimal prefix code with a in-isolation-optimal suffix code, but it did work for the most trivial cases (other than the degenerate 1-to-1 code) I could come up with.

Nicely done; thanks.

> It would be interesting to see a uniquely decodable code that is neither a prefix code nor one in reverse.

More interesting than I thought. First the adversarial answer; sure (edit: ah, I see someone else posted exactly the same!):

    a 101
    b 1

But it's a bad code, because we'd always be better with a=1 and b=0.

The Kraft inequality gives the sets of code lengths that can be made uniquely decodable, and we can achieve any of those with Huffman coding. So there's never a reason to use a non-prefix code (assuming we are doing symbol coding, and not swapping to something else like ANS or arithmetic coding).

But hmmmm, I don't know if there exists a uniquely-decodable code with the same set of lengths as an optimal Huffman code that is neither a prefix code nor one in reverse (a suffix code).

If I was going to spend time on it, I'd look at https://en.wikipedia.org/wiki/Sardinas-Patterson_algorithm -- either to brute force a counter-example, or to see if a proof is inspired by how it works.

link

n4r9 756 days ago

It's a weird example, but what about

  a 1
  b 101

It is neither prefix-free nor suffix-free. Yet every occurrence of 0 corresponds to an occurrence of b.

However, this is obviously inefficient. So I guess the question is whether there's an optimal code which is neither prefix-free nor suffix-free.

--------------

EDIT

I did some googling and found this webpage https://blog.plover.com/CS/udcodes.html where the author gives the following example of a uniquely decodable code:

I guess this is "almost" prefix-free since the only prefix is c of d. If a message starts wiht 1, you could find the first 0 and then look at whether there's an odd or even number of 1's. So I think I can see how it's uniquely decodable. However, my crypto knowledge is too rusty to remember how to show whether this is an optimal code for some probability distribution.

link

imurray 756 days ago

That code in the EDIT is suboptimal. It doesn't saturate the Kraft inequality. You could make every codeword two bits and still encode 4 symbols, so that would be strictly better.

link

n4r9 756 days ago

Ah of course. Thanks for the insight. About 15 years since I studied this stuff!

link

lazamar 756 days ago

That’s interesting. I guess this is not usually used because you may have a long string of bits that is ambiguous till you get to a disambiguating bit.

Something like

`100000000000000001`

In this case, where to know whether the first code was an `a` or a `c` you have to read all the way to where the zeroes end.

link