| People in the HPC/classical supercomputing space have done this sort of thing for a while. There's a fair amount of literature on lossless floating point compression, such as Martin Burtscher's work or stuff out of LLNL (fpzip): https://userweb.cs.txstate.edu/~burtscher/
https://computing.llnl.gov/projects/floating-point-compressi... but it tends to be very application specific, where there tends to be high correlation / small deltas between neighboring values in a 2d/3d/4d/etc floating point array (e.g., you are compressing neighboring temperature grid points in a PDE weather simulation model; temperature differences in neighboring cells won't differ by that much). In a lot of other cases (e.g., machine learning) the floating point significand bits (and sometimes the sign bit) tends to be incompressible noise. The exponent is the only thing that is really compressible, and the xor trick does not help you as much because neighboring values could still vary a bit in terms of exponents. An entropy encoder instead works well for that (encode closer to the actual underlying data distribution/entropy), and you also don't depend upon neighboring floats having similar exponents as well. In 2022, I created dietgpu, a library to losslessly compress/decompress floating point data at up to 400 GB/s on an A100. It uses a general-purpose asymmetric numeral system encoder/decoder on GPU (the first such implementation of general ANS on GPU, predating nvCOMP) for exponent compression. We have used this to losslessly compress floating point data between GPUs (e.g., over Infiniband/NVLink/ethernet/etc) in training massive ML models to speed up overall wall clock time of training across 100s/1000s of GPUs without changing anything about how the training works (it's lossless compression, it computes the same thing that it did before). https://github.com/facebookresearch/dietgpu |
One of the key innovations in the AMBER MD engine that made it work OK on cheaper systems was lossless floating point compression. It still impresses me that you can compress floats, send them over MPI, and decompress them, all faster/lower latency than the transport can send the uncompressed data.