Hacker News new | ask | show | jobs
by peter-ebert 720 days ago
I appreciate the input. I did not mean to imply I could encode a random stream of symbols/characters, that is absolutely valid. I was approaching this as a compression technique for something like a text file, where the symbol counts are known at compression time.
1 comments

First issue is that if you do comparisons you have to do them apples-to-apples, you can't assume that method A has that information available for free but method B does not or that method A assumes a given source model but method B assumes a different source model.

Second, I really don't understand how you intend to use a table of symbol counts: If you do it over the entire file the table might be a reasonable size but the number of permutations becomes infeasible. Conversely if you do it in small windows (like 8 or so in your examples) you have to store a separate symbol count table for each window which would explode the symbol count table. I really doubt you are gaining anything from doing this. You are going to create an enormous per-file symbol frequency table and then not count it against the compressed size, that isn't compression it's just misdirection.

When I get time may compress the table as well then, in looking at example arithmetic encoders they seem to have some flat table implementations as well (https://github.com/nayuki/Reference-arithmetic-coding/blob/m...). I don't see how that's misdirection, I left the table uncompressed specifically for transparency reasons. Was just trying to keep that part simple.