Hacker News new | ask | show | jobs
by makmanalp 3360 days ago
So stuff like this or marshmallow is more for cases when you have some database / ORM objects and you want to serialize them out to a json object, or you want to process form/POST data into a well-structured json or database object.

For your use case, it's more about large amounts of tabular data and efficient (binary / columnar / compressed) serialization and queryability. I'd say that the defacto standard for that is the HDF5 standard, which PyTables supports (http://www.pytables.org/). This is what pandas uses under the hood and I've been using this with hundreds of millions of rows with no problem.

Arrow is slightly more different - it's a specification for the in-memory layout of data that enables faster computation. This is more about what happens if you have data in memory and you want to use it with another tool - serializing / deserializing, munging formats is a waste of time if tools can standardize how they store dataframes in memory and can work on each other's tables. As far as I understand, Feather is not an implementation of arrow (that would be up to the processing tools like pandas), but supports a way of saving and loading that in-memory format to and from disk efficiently and in an interoperable way. (https://github.com/wesm/feather)

Also of note is parquet, which has similar goals to HDF and feather, but the continuum / dask people have been working on a wrapper for that called fastparquet (https://github.com/dask/fastparquet). In my experience it has a few hitches right now but works darn well, and gives me better performance than HDF. This is also one of the hadoop ecosystem defacto standards for storage formats, which again is good for interop.

1 comments

Do you know of a source that compares these different libraries in terms of capabilities, focus/use cases, size limits, performance, format support, etc.?

Googling turned up very little for me.

TIA

Edit: libraries mentioned in thread:

PMML, Arrow, Dill, marshmallow, pytables, parquet/fastparquet (and pickle, obviously)

No, I don't, but some of these are apples and oranges, that was part of my point. You're conflating many different types of things.

Specifically, the ones I talked about are for storing large tabular datasets on disk. Stuff that lays out data on disk so that it's easy and efficient to query only a part of the dataset, e.g. only certain columns or only certain rows that match a predicate or within a range of indexes. These can store hundreds of gb, no problem. They often have some sort of compression, like LZ, snappy or blosc that has relatively low CPU overhead while giving decent compression. I tried to separate the file formats (which are readable from other languages) from the python libraries that write them. For this, I'd default to pytables / HDF5, barring some specific use case where you'd already know what other one you need.

Dill / pickle are for serializing generic python objects. I wouldn't really use them to store anything big, but it's very convenient for complicated data structures, like hierarchies of objects and classes. E.g. to save the current running state of your program. You don't have to think about storage formats and layouts and serialization routines, if you have a list of python objects you can pickle it. Pickle is built in, while dill is an external library that nicely handles a bunch more edge cases.

PMML seems like an XML based format specifically for trained machine learning models. Don't really know much about this.