| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by makmanalp 3357 days ago

No, I don't, but some of these are apples and oranges, that was part of my point. You're conflating many different types of things.

Specifically, the ones I talked about are for storing large tabular datasets on disk. Stuff that lays out data on disk so that it's easy and efficient to query only a part of the dataset, e.g. only certain columns or only certain rows that match a predicate or within a range of indexes. These can store hundreds of gb, no problem. They often have some sort of compression, like LZ, snappy or blosc that has relatively low CPU overhead while giving decent compression. I tried to separate the file formats (which are readable from other languages) from the python libraries that write them. For this, I'd default to pytables / HDF5, barring some specific use case where you'd already know what other one you need.

Dill / pickle are for serializing generic python objects. I wouldn't really use them to store anything big, but it's very convenient for complicated data structures, like hierarchies of objects and classes. E.g. to save the current running state of your program. You don't have to think about storage formats and layouts and serialization routines, if you have a list of python objects you can pickle it. Pickle is built in, while dill is an external library that nicely handles a bunch more edge cases.

PMML seems like an XML based format specifically for trained machine learning models. Don't really know much about this.