Hacker News new | ask | show | jobs
by dekhn 1262 days ago
I would zarr for dense matrices, mostly (I use them with microscope images). I see it also used for frequency/spatial observations in genomic imaging. But I prefer parquet for most direct analysis of sequence, since it's the format best integrated with big data analytics. I care much less about total compression size than I do the ability to decompress the data I need quickly (say, to ETL it to a featurization pipeline).
1 comments

I grep'd for parquet and yours is the only comment that mentions it. parquet -> arrow -> AI
Many thanks. Q: what's the story around versioning, provenance, reproducibility, (etc.ops) in your domain? I've seen various bolt x on/along git variants. Wondering if its worth the effort to make something to address that.