| HN Mirror

TL;DR: In climatology, I know people are using zarr. However, I think columnar storage as in parquet also merits consideration.

My thinking goes as follows: I'm trying to read chunks from n-dimensional data with a minimum of skips/random reads. For user-facing analytics and drilling down into the data, these chunks tend to be relatively few, and I'd like to have them close to one another. For high-level statistics however, I only care that the data for each chunk of work be contiguous, since I'm going to read all chunks eventually anyways.

You can reach these goals with a partitioning strategy either in HDF or zarr or parquet, but you could also reach it with blob fields in a more traditional DB, be it relational or document based or whatever. Since any storage and memory is linear, I don't care whether a row-major or column-major array is populated from a 1d vector from columnar storage with dimensionality metadata or an explicitly array based storage format; I just trust that a table with good columnar compression doesn't waste too much storage on what is implicit in (dense) array storage.

Often, I've found that even climatological data _as it pertains to a specific analytic scenario_ is actually a sparse subset of an originally dense nd-array, e.g. only looking at data over land. This has led me to advocate for more tabular approaches, but this is very domain specific.