Hacker News new | ask | show | jobs
by peter-ebert 719 days ago
This arithmetic compression example simply stores the 4 byte values of the frequencies (I used 8 bytes) in a flat table https://github.com/nayuki/Reference-arithmetic-coding/blob/m...

For input3 they used 20 additional bytes while I used 12. Though I know they said their implementation isn't perfect.

2 comments

That isn't updating the counts after every symbol. Why don't you write one that does so you can compare them fairly?
Sure thing! Does this meet your requirements? https://github.com/nayuki/Reference-arithmetic-coding/blob/m...

If not could you link one that does? If I implement my own I expect someone to say I did it wrong.

Sorry, I (and others) have explained several times what an equivalent arithmetic coding scheme would have to do to make the same assumptions (and have the same limitations) as your scheme.

It would have to assume there's _exactly_ N ones and N zeros in the bit string. Start with those counts:

remaining ones = N;

remaining zeros = N;

It would encode bit for bit. After every zero, it decrements the remaining number of zeros. After every one, it decrements the remaining number of ones. The probability of the next bit being zero or one is derived from the known remaining number of each, P(one) = (remaining ones / (remaining ones + remaining zeros)). When one of the counts reaches zero, the probability of the other becomes 100% and no more information need to be encoded.

Does the linked code do that? No.

I don't have a ready-made example because this is not an assumption about the input that is useful in the majority of cases. The linked example can encode _any sequence_, not just ones that have an equal number of ones and zeros.

If you spent time implementing this scheme you would learn a lot about this field.

Thanks I'll work on that. One thing though > The linked example can encode _any sequence_, not just ones that have an equal number of ones and zeros.

The compression I made can also encode any sequence of bytes, see the testinput folder (caveats in the FAQ are because of the implementation specifics not the algorithm), it just also has to store the freq table which I'm working on compressing too. It has no requirement for equal number of 1s and 0s, that was just an example in the description that I guess did more bad than good.

I mean without storing the freq table, which the last example you linked doesn't. Having to store a frequency table limits the encoder to do no better than the 0-th order entropy. Arithmetic coding can do arbitrarily complex prediction per symbol, see e.g. http://mattmahoney.net/dc/text.html#1072 which reaches the world record by training a neural network to predict the next symbol while it's encoding a file.

Not to say a 0-th order entropy coder could not be useful for certain applications if it was e.g. faster than, say, a range coder while maintaining similar ratios. But I doubt that is possible with this method.

I've added basic bit packing to the frequencies table, so long as the dictionary doesn't dominate like in a pangram it's slightly smaller than two arithmetic (static and adaptive) examples I could find online. I understand this isn't the best that arithmetic can do, next I'll look into that.

Never was trying to compare to NNs or PPM (or even simple LZ), since all this algorithm does is look at the frequency counts.

Agree it's not very practical and very expensive calculations, just fun to work on.

*I mean input2, sorry long day.