Hacker News new | ask | show | jobs
by MichaelSalib 2304 days ago
Ah, thanks for these! But I see nothing has changed. * pyfive is interesting but immature and doesn't seem to have any cloud bucket support * h5s3 is an abandoned experiment that hasn't been touched in two years * h5py is fine but again, no cloud support * kita is a commercial offering from the HDF Group and -- I cannot stress this enough -- these people are shockingly incompetent; plus when I last looked at their system architecture diagram I thought it was a joke (well, I thought it was an intentional joke)

Efficient access to scientific datasets hosted on S3/GCP is a full blown crisis in the scientific computing community. People aren't switching to zarr for the fun of it, but because zarr is here, today, and isn't a joke, and is actually open.

1 comments

It's been a while since I worked on it, but I did get pyfive to work reading from S3 objects using either IOBytes around the entire bytearray read into memory or against a custom class that implemented peek, seek, etc. against an S3 object (the first method was better if you need to read a majority of a large file, the second was better for a small subset of it). Note that it supports read-only not write. Later I heard that I wouldn't have to use pyfive since h5py now supports file-like objects. So your comments about no cloud bucket support are not exactly true.
To be clear, our experience using gcsfuse and friends to do basically the same things was extremely painful and a performance nightmare. The HDF format was designed for a world where seeks are free which makes cloud access very high latency and very low throughput.