|
|
|
|
|
by Ari_Rahikkala
1223 days ago
|
|
> Models like ChatGPT aren’t eligible for the Hutter Prize for a variety of reasons, one of which is that they don’t reconstruct the original text precisely—i.e., they don’t perform lossless compression. Small nit: The lossiness is not a problem at all. Entropy coding turns an imperfect, lossy predictor into a lossless data compressor, and the better the predictor, the better the compression ratio. All Hutter Prize contestants anywhere near the top use it. The connection at a mathematical level is direct and straightforward enough that "bits per byte" is a common number used in benchmarking language models, despite the fact that they are generally not intended to be used for data compression. The practical reason why a ChatGPT-based system won't be competing for the Hutter Prize is simply that it's a contest about compressing a 1GB file, and GPT-3's weights are both proprietary and take up hundreds of times more space than that. |
|
Apparently it leads the compression of enwik9 ( http://www.mattmahoney.net/dc/text.html ) . Not sure why it isn't eligible for the Hutter Prize, there's some speculations in the previous discussion but I don't know whether they're true.