|
|
|
|
|
by makmanalp
3360 days ago
|
|
So stuff like this or marshmallow is more for cases when you have some database / ORM objects and you want to serialize them out to a json object, or you want to process form/POST data into a well-structured json or database object. For your use case, it's more about large amounts of tabular data and efficient (binary / columnar / compressed) serialization and queryability. I'd say that the defacto standard for that is the HDF5 standard, which PyTables supports (http://www.pytables.org/). This is what pandas uses under the hood and I've been using this with hundreds of millions of rows with no problem. Arrow is slightly more different - it's a specification for the in-memory layout of data that enables faster computation. This is more about what happens if you have data in memory and you want to use it with another tool - serializing / deserializing, munging formats is a waste of time if tools can standardize how they store dataframes in memory and can work on each other's tables. As far as I understand, Feather is not an implementation of arrow (that would be up to the processing tools like pandas), but supports a way of saving and loading that in-memory format to and from disk efficiently and in an interoperable way. (https://github.com/wesm/feather) Also of note is parquet, which has similar goals to HDF and feather, but the continuum / dask people have been working on a wrapper for that called fastparquet (https://github.com/dask/fastparquet). In my experience it has a few hitches right now but works darn well, and gives me better performance than HDF. This is also one of the hadoop ecosystem defacto standards for storage formats, which again is good for interop. |
|
Googling turned up very little for me.
TIA
Edit: libraries mentioned in thread:
PMML, Arrow, Dill, marshmallow, pytables, parquet/fastparquet (and pickle, obviously)