| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mhneu 3358 days ago

Python's data infrastructure has a huge problem: serialization and thus saving data results.

A good serialization library should serialize:

  - classes/objects (best practice: objects for holding data)
  - pandas/numpy objects (must have: minimizing space)
  - namedtuples (currently: a mess, factory implementation)
  - dicts and lists of dicts (must have: space efficiency)

Compare to Matlab: save(f, 'anyobject'); anyobject=load(f)

Python is terrible at this and it limits use in real data analysis environments and limits competition with matlab.

4 comments

fnord123 3358 days ago

> Compare to Matlab: save(f, 'anyobject'); anyobject=load(f)

If you want matlab files in Python you can use `scipy.io.loadmat('file.mat')`. PyTables (built on hdf5) is a better solution since the hdf5 format is a lot more flexible than matlab's (ime). But Parquet is looking to be the best solution moving forward as it's gaining a lot of mindshare as the go-to flexible format for data and will be / is used in Arrow.

But really, Matlab is on par with pickles when it comes to serialisation. It's a trap solution.

link

auxym 3358 days ago

Actually, since Matlab v7.3, .mat files are actually hdf5 files.

link

joshuamorton 3358 days ago

To expand on fnord, to my knowledge, pickle handles all of these things. Its still a bad solution, but it does everything you want.

    pickle.dump(f, anyobject)
    anyobject = pickle.load(f)

link

sidlls 3358 days ago

Pickle had size constraints that make it unsuitable in certain ML applications.

link

joshuamorton 3358 days ago

Indeed, but I expect that's also true for matlab's vanilla solution.

link

vosper 3358 days ago

Does using protocol version 4 help with this?

link

mikeywaites 3358 days ago

Thanks for expanding on that mhneu. So our primary focus with Kim has certainly been around serializing/marshaling JSON though we've used it for plenty of other uses cases.

It's great to get a view of other problems people are experiencing.

Now we've finished wrapping up 1.0.0 we're going to be spending some time on the roadmap of new features. I personally feel variation in use cases from our own is only going to help make Kim better so we'll defo look into this problem some more in the near future. Right now though i couldn't say for sure what Kim would have to offer when working with Pandas etc as we've simply never tried.

defo?

definitely.

sandGorgon 3358 days ago

i believe the benchmark is set by R and it's RData format. It saves everything in the R domain. ML models, dataframes, everything.

Works pretty well - I know of large financial firms that are using this in production to load large trained models of size hundreds of GB

link