|
|
|
|
|
by kingsleyopara
677 days ago
|
|
I struggle to see the utility of projects like this. For tabular data in active use, decompression will still require the same peak memory, so optimizing data types (e.g., reducing float and integer precision, using categorical columns) is more effective. For storage or unused data, a more portable and supported solution like Apache Parquet, which offers native compression, or simply gzipping a CSV, seems more practical. |
|
> The library's main goal is to compress data frames, excel and csv files so that they consume less space to overcome memory errors. Also to enable dealing with large files that can cause memory errors when reading them in python or that cause slow operations. With lzhw, we can read compressed files and do operations column by column and on specific rows only on chunks that we are interesred in.
If I understand it correctly, it means that the goal is to keep a losslessly compressed copy of data in memory, and provide ways you could work with the data column by column or even in chunk, to reduce the amount of memory needed to complete an operation. And it deals with it generally (not necessarily categorical data) and losslessly (you cannot impose lossiness arbitrarily).
But this library seems to be designed for a very niche purpose. It mentions laptop here and there in the doc. And the use case is the kind of datasets with size just above what your laptop memory has, whereas the losslessly compressed data still fits. That makes it hard to write production code with, as the advantage of compression is unpredictable. Even if it is just for explorative data analysis, it puts a burden on the mental model to reason with, as you really need to be just in the right spot for this to be useful. (There are techniques that can stream data from file to handle data bigger than available memory.)