|
Otherwise, HDF5 offers every single advantage that zarray has and is much more mature, stable, better documented, and has better support. Absolutely not. HDF5 is an awful format with terrible implementations. For example, try writing a python program with multiple threads where each thread writes to a different HDF5 file. This should just work -- there's no concurrent access. And yet it doesn't because HDF5 implementations are piles of ancient C code that use lots of global state. There's no technical reason for this; one could easily store all the state needed in a per-file object. But back in the day, software eng standards were lower (especially for scientists) and HDF5 changes at a glacial place. I've been bitten by this particular bug, but you really have to wonder: given how poorly it speaks to the software engineering behind HDF5 implementations, what else is broken in the code or specifications? If you're working in a situation where it makes sense to have things on disk or some sort of NFS share, use HDF5. If you're working with objects in a cloud bucket, you'll incur additional overhead with HDF5, as you'll have to read its table of indices, then make range requests to each chunk. Zarr is optimized for the cloud use case. When last I looked, there were no open source HDF5 implementations that were smart enough to do range requests to cloud hosted hdf files. Has this changed? |