Hacker News new | ask | show | jobs
by mhneu 3358 days ago
Python's data infrastructure has a huge problem: serialization and thus saving data results.

A good serialization library should serialize:

  - classes/objects (best practice: objects for holding data)
  - pandas/numpy objects (must have: minimizing space)
  - namedtuples (currently: a mess, factory implementation)
  - dicts and lists of dicts (must have: space efficiency)
Compare to Matlab: save(f, 'anyobject'); anyobject=load(f)

Python is terrible at this and it limits use in real data analysis environments and limits competition with matlab.

4 comments

> Compare to Matlab: save(f, 'anyobject'); anyobject=load(f)

If you want matlab files in Python you can use `scipy.io.loadmat('file.mat')`. PyTables (built on hdf5) is a better solution since the hdf5 format is a lot more flexible than matlab's (ime). But Parquet is looking to be the best solution moving forward as it's gaining a lot of mindshare as the go-to flexible format for data and will be / is used in Arrow.

But really, Matlab is on par with pickles when it comes to serialisation. It's a trap solution.

Actually, since Matlab v7.3, .mat files are actually hdf5 files.
To expand on fnord, to my knowledge, pickle handles all of these things. Its still a bad solution, but it does everything you want.

    pickle.dump(f, anyobject)
    anyobject = pickle.load(f)
Pickle had size constraints that make it unsuitable in certain ML applications.
Indeed, but I expect that's also true for matlab's vanilla solution.
Does using protocol version 4 help with this?
Thanks for expanding on that mhneu. So our primary focus with Kim has certainly been around serializing/marshaling JSON though we've used it for plenty of other uses cases.

It's great to get a view of other problems people are experiencing.

Now we've finished wrapping up 1.0.0 we're going to be spending some time on the roadmap of new features. I personally feel variation in use cases from our own is only going to help make Kim better so we'll defo look into this problem some more in the near future. Right now though i couldn't say for sure what Kim would have to offer when working with Pandas etc as we've simply never tried.

defo?
definitely.
i believe the benchmark is set by R and it's RData format. It saves everything in the R domain. ML models, dataframes, everything.

Works pretty well - I know of large financial firms that are using this in production to load large trained models of size hundreds of GB