Hacker News new | ask | show | jobs
by jiggawatts 1359 days ago
"Just works" and "I've been waiting 15 minutes now for this file to un-gzip" aren't compatible in my book, especially on a computer that should be able to process that file in seconds.

Also, I'd love to see someone open a 75 GB FASTQ file in Excel.

3 comments

VCF at least typically uses bgzip which is essentially gzipped sections concatenated, but parallel unzipable for random access, cram is also parallelisable in the same way. Maybe you just dont know the formats and tooling so well? Im not sure anyone opens a fastq directly for viewing anymore, but they will want pile ups from a bam. The problem with bio formats isnt that they're text its that they are shit text formats too.
CRAM is a great example for some of the other people in the thread who say "just get a better format". There's been slow uptake in the larger community despite the benefits. For anyone looking to Solve Bioinformatics File Formats, it's important to understand why this is the case.
Nearly all bioinfo tools operate in streaming mode which means line based gzipped formats work great as you can parallelise the processing with reading the file. Nobody ever unzips the whole file before starting to process it.
FASTQ is not for Excel, obviously - although you can still explore it in the shell. Nonetheless operating directly on FASTA/FASTQ files is often a "one-time" preprocessing task. You then serialize the preprocessed data and continue on from there.

FASTA (and its various incantations) are not going anywhere anytime soon.