| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by MichaelSalib 2310 days ago

Otherwise, HDF5 offers every single advantage that zarray has and is much more mature, stable, better documented, and has better support.

Absolutely not. HDF5 is an awful format with terrible implementations. For example, try writing a python program with multiple threads where each thread writes to a different HDF5 file. This should just work -- there's no concurrent access. And yet it doesn't because HDF5 implementations are piles of ancient C code that use lots of global state. There's no technical reason for this; one could easily store all the state needed in a per-file object. But back in the day, software eng standards were lower (especially for scientists) and HDF5 changes at a glacial place.

I've been bitten by this particular bug, but you really have to wonder: given how poorly it speaks to the software engineering behind HDF5 implementations, what else is broken in the code or specifications?

If you're working in a situation where it makes sense to have things on disk or some sort of NFS share, use HDF5. If you're working with objects in a cloud bucket, you'll incur additional overhead with HDF5, as you'll have to read its table of indices, then make range requests to each chunk. Zarr is optimized for the cloud use case.

When last I looked, there were no open source HDF5 implementations that were smart enough to do range requests to cloud hosted hdf files. Has this changed?

3 comments

hcrisp 2310 days ago

Have you looked at pyfive [0], h5s3 [1], or Kita [2]? What about version 2.9 of h5py [3] which supports file-like object access?

  [0] https://github.com/jjhelmus/pyfive
  [1] https://h5s3.github.io/h5s3/python.html
  [2] https://www.hdfgroup.org/solutions/hdf-kita/
  [3] http://docs.h5py.org/en/stable/high/file.html#python-file-like-objects

link

xscott 2310 days ago

Clickable:

https://github.com/jjhelmus/pyfive

https://h5s3.github.io/h5s3/python.html

https://www.hdfgroup.org/solutions/hdf-kita/

http://docs.h5py.org/en/stable/high/file.html#python-file-li...

link

MichaelSalib 2310 days ago

Ah, thanks for these! But I see nothing has changed. * pyfive is interesting but immature and doesn't seem to have any cloud bucket support * h5s3 is an abandoned experiment that hasn't been touched in two years * h5py is fine but again, no cloud support * kita is a commercial offering from the HDF Group and -- I cannot stress this enough -- these people are shockingly incompetent; plus when I last looked at their system architecture diagram I thought it was a joke (well, I thought it was an intentional joke)

Efficient access to scientific datasets hosted on S3/GCP is a full blown crisis in the scientific computing community. People aren't switching to zarr for the fun of it, but because zarr is here, today, and isn't a joke, and is actually open.

link

hcrisp 2310 days ago

It's been a while since I worked on it, but I did get pyfive to work reading from S3 objects using either IOBytes around the entire bytearray read into memory or against a custom class that implemented peek, seek, etc. against an S3 object (the first method was better if you need to read a majority of a large file, the second was better for a small subset of it). Note that it supports read-only not write. Later I heard that I wouldn't have to use pyfive since h5py now supports file-like objects. So your comments about no cloud bucket support are not exactly true.

link

MichaelSalib 2310 days ago

To be clear, our experience using gcsfuse and friends to do basically the same things was extremely painful and a performance nightmare. The HDF format was designed for a world where seeks are free which makes cloud access very high latency and very low throughput.

link

kortex 2310 days ago

This is good info. I've been wary of hdf5 for some time. Nothing concrete (until this bug) but from my research it just consistently smelled fishy. The main turnoff for me was the possibility of data corruption bricking the entire dataset.

Pity, as it has on paper a lot of great concepts and features. Maybe it'll be mature enough someday, though my money is on something better from the ground up coming along.

Honestly, most of the portability advantage is moot nowadays. Chunk s3-like storage, smb, and ability to copy files from ext to ntfs (at least on nix) means that sharing your data across platforms isn't the struggle it used to be. Windows is rapidly becoming/already is a second class citizen in science-data heavy workflows.

I ended up going with a NAS and just file system primitives for my computer vision image workflow, works great.

https://stackoverflow.com/questions/35837243/hdf5-possible-d...

https://cyrille.rossant.net/moving-away-hdf5/

link

xen0 2310 days ago

The main turnoff for me was the possibility of data corruption bricking the entire dataset.

A glib high level overview of my last job for 6 years was "write out HDF5 files". In that time, I don't recall seeing a true data corruption problem with HDF5.

Now, I ran into many other problems with HDF5, typically surrounding the newer features that came along in 1.10, and its threading limitations. The older folks at that job would mention historical issues with data corruption (often from reading files as they're being written to), but I never saw it myself.

link

benibela 2310 days ago

I thought you could not write parallel anything in Python because of the GIL

link

dekhn 2310 days ago

It's... complicated. Certainly you can write parallel code in Python using the GIL; there are several scenarios. The shortest answer is "the multiprocessing library, when used carefully, can speed up the runtime of your CPU-intensive, multiple process python program by spreading work across multiple processors/cores". The longer answer is: many IO-bound Python programs can be sped up using multithreading within a single Python process (because the application is mostly waiting for IO), and many CPU-intensive Python programs can be sped up using multithreading where work is done in C functions that release the GIL.

Many python programs I write end up using 8+ cores on a single machine using either multiprocessing or C functions with released GIL.

link

MichaelSalib 2310 days ago

No, you can certainly write in parallel, despite the GIL. The GIL makes this inefficient if your work is CPU-bound, but for IO-bound workloads it can be fine.

link

xen0 2310 days ago

But the HDF5 library does not really support multi-threading at all. Compiling the library with the "threading" option just locks around every API call, so you're back to a single thread whenever you enter (compiling without it will just crash your program).

And the library does quite a lot of work when you call into it; chunk lookup, decompression, and type conversions all happen behind that lock. You can use the "direct chunk access" functions (H5Dread_chunk?) to bypass a lot of that work and do it yourself, so you get back to using multiple threads again, and that can be a big win, but having to do it sucks, and I don't think h5py exposes this functionality at all.

link