Hacker News new | ask | show | jobs
by vamin 3005 days ago
An RNA sequencing run generates on the order of 10GB of data, a typical study requires many runs (treatments, controls, replication of results, etc), and posting the raw data is required by most biology journals. I'm not surprised that there is over 1PB of data available to curate.
1 comments

Oh, you mean BAM files? Get yourself a retention policy; you don't need to keep RNA BAM files that long.

I thought you meant derived data.

I'm talking about the raw reads, which is important if you want to try a different alignment or base-calling method. You can debate how important it is to be able to do that, but I'm not trying to argue that the data should be kept, I was just explaining why the total size of publicly available RNA-seq data (the sum total of which the parent is attempting to organize) runs in the petabytes.
So, do you or the original poster actually have a materialized petabyte of RNA data? Otherwise, you're just describing a million files spread over a million file servers, not being used for science or processed in any way.