| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Ari_Rahikkala 1269 days ago

> Models like ChatGPT aren’t eligible for the Hutter Prize for a variety of reasons, one of which is that they don’t reconstruct the original text precisely—i.e., they don’t perform lossless compression.

Small nit: The lossiness is not a problem at all. Entropy coding turns an imperfect, lossy predictor into a lossless data compressor, and the better the predictor, the better the compression ratio. All Hutter Prize contestants anywhere near the top use it. The connection at a mathematical level is direct and straightforward enough that "bits per byte" is a common number used in benchmarking language models, despite the fact that they are generally not intended to be used for data compression.

The practical reason why a ChatGPT-based system won't be competing for the Hutter Prize is simply that it's a contest about compressing a 1GB file, and GPT-3's weights are both proprietary and take up hundreds of times more space than that.

1 comments

hnfong 1269 days ago

Fabrice Bellard has a project that does precisly this. And does it extremely well, apparently. Previously on HN: https://news.ycombinator.com/item?id=27244004

Apparently it leads the compression of enwik9 ( http://www.mattmahoney.net/dc/text.html ) . Not sure why it isn't eligible for the Hutter Prize, there's some speculations in the previous discussion but I don't know whether they're true.

link

Der_Einzige 1269 days ago

Thank you! Turns out that GPT does in fact perform lossless compression if you want it to, like in this demo.

link

hnfong 1268 days ago

The main issue is that most ML frameworks aren't reliably reproducible, and are not designed for such use cases.

Bellard's solution was to code up his own neural network library in C.

link

kragen 1269 days ago

it takes too long to run

link