Hacker News new | ask | show | jobs
by albertzeyer 792 days ago
https://xkcd.com/927/ ?

We still use HDF (https://en.wikipedia.org/wiki/Hierarchical_Data_Format).

But I wonder, if I would choose a new file format today, what to choose? Nimble is maybe too new and there is too less experience with it (outside Meta).

Is there anywhere a good overview of all available options, and some fair comparison? Some that I found, but older:

https://www.hopsworks.ai/post/guide-to-file-formats-for-mach...

https://iopscience.iop.org/article/10.1088/1742-6596/1085/3/...

https://github.com/pangeo-data/pangeo/issues/285

1 comments

Well, Parquet seems to be so widely supported, it's my default pick, unless you can explain why it's not the right fit.

Though I'll say if your primary use case is "higher-dimensional arrays", none of Parquet etc are likely to be a good fit -- these things are columnar formats where each column has a separate name, datatype etc, not formats for multi-dimensional arrays of numbers. That's a different problem. A Parquet column can be a list of arrays, but there's no special handling of matrices.