Hacker News new | ask | show | jobs
by 7373737373 2 days ago
What does it compress the full 1GB file to? http://prize.hutter1.net/
3 comments

I tried it on a enwik9 100 mb slice and was able to compress it to 20 mb + 900kb transformer so 21mb.

I know the top submission was able to get it to 13 mb.

Still trying some ideas to get better compression.

Since you know the size of the file beforehand you may be able to overfit some kind of text diffusion model instead of a transformer? May allow you to partially correct the model output using some other method and then fill in the blanks that were wrong from previous generations.
Oh, sounds interesting. I hadn't considered using a diffusion model for this. My current approach generates the document byte by byte with an autoregressive transformer, so I'm curious how a diffusion model would improve memorization or reconstruction quality.

Can you point me to something that i can read? I really wanna try this approach , diffusion model does sounds interesting for compression.

Thanks for the link!
Maybe everyone should compress the 1st 100MB worth of digits of pi, for an apples-to-apples comparison?

Edit: oh wait that's too easy. Need to generate /publish random digits so everyone can use it.

random digits aren't compressible though?
Random digits are compressible though.

Random data does not mean it does not match a pattern in your dictionary for example.

No.. they're not. Do you understand random (the apparent or actual lack of definite patterns or predictability[0]) or compression (reduces bits by identifying and eliminating statistical redundancy[1])?

[0]: https://en.wikipedia.org/wiki/Randomness

[1]: https://en.wikipedia.org/wiki/Data_compression

Over infinite runs, you can't compress random data, but that doesn't mean any finite string of random digits is incompressible
by this definition, a random dataset could apparently present no patterns, while presenting non apparent patterns.
Sounds like presenting no patterns, apparently or otherwise, would be a pattern in itself.
Compressor: Output an empty file.

Decompressor: Take any old algorithm for finding digets of pi, find first 100M of them, print them.

Compression ratio of 0! :0