|
|
|
|
|
by xaa
3910 days ago
|
|
It's too bad that this is for sparse data only. ML datasets have differing degrees of sparsity, and when the sparsity gets low enough, it's more efficient to use dense matrices, even when there are still missing values. Also if you have dense data, you can use mmap, which isn't very space efficient but is very fast. I guess it could also be made to be space efficient if you use a filesystem with transparent compression. |
|
If you want your mmap to magically see the uncompressed data, your file system will have to do decompress the data, and that doesn't come for free.
I would try and aim for compression in the application, as data size likely will be the bottleneck in reading and writing such files. If your data isn't very sparse, you could delta encode the indices of the non-zero columns, and use some variable-length encoding for it. Compressing each row of deltas may help after the delta encoding (especially if it is reasonably dense, because you expect the deltas to be small).
Once you go down that route, you have sacrificed simplicity, so you might just as well encode your floats, too.