Hacker News new | ask | show | jobs
by xaa 3910 days ago
It's too bad that this is for sparse data only. ML datasets have differing degrees of sparsity, and when the sparsity gets low enough, it's more efficient to use dense matrices, even when there are still missing values.

Also if you have dense data, you can use mmap, which isn't very space efficient but is very fast. I guess it could also be made to be space efficient if you use a filesystem with transparent compression.

1 comments

If you combine mmap with "filesystem with transparent compression", and want the efficiency of mmap, mmap will only see the compressed data.

If you want your mmap to magically see the uncompressed data, your file system will have to do decompress the data, and that doesn't come for free.

I would try and aim for compression in the application, as data size likely will be the bottleneck in reading and writing such files. If your data isn't very sparse, you could delta encode the indices of the non-zero columns, and use some variable-length encoding for it. Compressing each row of deltas may help after the delta encoding (especially if it is reasonably dense, because you expect the deltas to be small).

Once you go down that route, you have sacrificed simplicity, so you might just as well encode your floats, too.