Hacker News new | ask | show | jobs
by rabernat 1261 days ago
The Zarr format is used in some genomics workflows (see https://github.com/zarr-developers/community/issues/19) and supports a wide range of modern compressors (e.g. Zstd, Zlib, BZ2, LZMA, ZFPY, Blosc, as well as many filters.)
1 comments

I would zarr for dense matrices, mostly (I use them with microscope images). I see it also used for frequency/spatial observations in genomic imaging. But I prefer parquet for most direct analysis of sequence, since it's the format best integrated with big data analytics. I care much less about total compression size than I do the ability to decompress the data I need quickly (say, to ETL it to a featurization pipeline).
I grep'd for parquet and yours is the only comment that mentions it. parquet -> arrow -> AI
Many thanks. Q: what's the story around versioning, provenance, reproducibility, (etc.ops) in your domain? I've seen various bolt x on/along git variants. Wondering if its worth the effort to make something to address that.